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Preface 


This book is a general introduction to the statistical analysis of networks, and can 
serve both as a research monograph and as a textbook. Many fundamental modern 
tools and concepts needed for the analysis of networks are presented, such as net- 
work modeling, community detection, graph-based semi-supervised learning and 
sampling in networks. The description of these concepts is self-contained, with 
both theoretical justifications and applications provided for the presented algo- 
rithms. 

Researchers, including postgraduate students, working in the area of network 
science, complex network analysis, or social network analysis, will find up-to-date 
statistical methods relevant to their research tasks. This book can also serve as text- 
book material for courses related to the statistical approach to the analysis of com- 
plex networks. 

In general, the chapters are fairly independent and self-supporting, and the book 
could be used for course composition “a la carte”. Nevertheless, Chapter 2 is needed 
to a certain degree for all parts of the book. It is also useful to read Chapter 4 before 
reading Chapters 5 and 6, but this is not absolutely necessary. Reading Chapter 3 
can also be helpful before reading Chapters 5 and 7. 

As prerequisites for reading our book, we expect basic knowledge in probability, 
linear algebra and elementary notions of graph theory. We have also added appen- 
dices describing some required notions from the above mentioned disciplines. 


DOT: 10.1561/9781638280514.ch1 


Chapter 1 


Introduction 


A network is a collection of objects interacting with each other. Networks are 
found in numerous scientific disciplines: atoms or interacting particles in statisti- 
cal physics, protein interactions in molecular biology, social networks in sociology 
and the Internet web-graph in computer science, just to name a few. Several types 
of interactions exist. While binary interactions are the simplest (did Alice interact 
with Bob today?), weighted interactions (the number of interactions between Alice 
and Bob today) or temporal interactions (at what precise times did Alice and Bob 
interact?) provide additional valuable information. 

Networks with binary interactions are conveniently represented by a graph. 
A graph G isa pair (V, E), where V is the set of objects (also called nodes or vertices), 
and Æ is the set of interacting node pairs (also called edges or links). This standard 
graph representation can be extended to weighted networks or temporal networks 
by considering weighted edges or temporal sequences of edges. In the first and sec- 
ond parts of the introductory section, we present several examples of real-world 
networks and describe unifying properties. 


2 Introduction 


1.1 Examples of Networks 


Let us present several examples of real-world networks. Although, for clarity of 
exposition, we categorise the networks by types, this classification is subjective, and 
a network could belong to two or more types. 


Social networks 


One of the first examples of social networks is the Zachary karate club, representing 
the friendships between the 74 members of a karate club (see Figure 1.1). Dur- 
ing the two-year study (Zachary, 1977), the club members split into two groups 
after a feud occurred between the main instructor and the club's president. This 
dispute makes the dataset extremely popular in the network science community. 
We would like to answer the intriguing question: can one predict the resulting two 
groups based only on the friendship graph? This lays the ground for the problem 
of community detection, which we will discuss in detail in Chapter 4. 

Data concerning social networks of real-life social relationships (acquaintances, 
interactions) are notoriously hard to gather. Indeed, questionnaires are physical and 
take time to analyse, making the collection from a large number of individuals dif- 
ficult. Moreover, they are prone to human error and personal interpretation. For- 
tunately, it is much easier to gather examples of datasets in online social networks. 


Figure 1.1. Karate club. 
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Figure 1.2. Two largest communities of the LiveJournal network. 


Figure 1.3. Political Blogs network. 


Thus, it is not surprising that most examples of datasets of large social networks 
come from online social networks or web-blogs. 

One such example is the LiveJournal dataset. LiveJournal is an online blogging 
community in which users can befriend each other. The users are also free to create 
groups which other users can join. These groups can be considered as ground-truth 
communities. Figure 1.2 shows the LiveJournal friendship network restricted to the 
two largest communities. 

Adamic and Glance, 2005 studied the linking patterns of political bloggers dur- 
ing the U.S. Presidential Election of 2004. They considered 1494 blogs in total, 
759 liberal and 735 conservative, and constructed the interactions by identifying 
whether one blog references another blog. As shown in Figure 1.3, the difference 
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Table 1.1. Dimensions of three data sets of interacting high school students: the number 
of students z, the number of classes K, and the number of snapshots T. 


Year n K T 


2011 118 3 5609 
2012 180 5 11273 
2013 327 9 7375 
Ber Class 

Hm 2BI01 Em MP 

_ mm 28102 = PC 

a MS 2BI03 Mmm PC* 

z E MP*1 EE PSi* 


MP*2 


Figure 1.4. Time-aggregated network obtained from the high-school interaction network 
(year 2013). 


between the liberal and conservative blogospheres is clear. Indeed, 90% of the inter- 
actions occur between blogs belonging to the same political community. 


Some other prominent online social networks are Twitter, Facebook and 
LinkedIn. 


Face-to-face interaction networks 


The high-school datasets represent close proximity encounters between students in a 
French high school. Student-to-student interactions are recorded every 20 seconds 
through wearable sensors, and the experiments span several school days. The same 
experiment was performed in three consecutive years (Fournet and Barrat, 2014; 
Mastrandrea eż 4l., 2015), and the dimensions of each dataset are given in Table 1.1. 
We also plot in Figure 1.4 the weighted graph for the year 2013, where the weights 
correspond to the number of interactions recorded between two students. Finally, 
as each student belongs to one class, the question of recovering the classes based 
on the temporal interactions arises. We will study this dataset in more detail in 
Chapter 6. 

We note that time aggregation can result in a loss of important information, 
which could otherwise be inferred from the dataset’s temporal nature. For example, 
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8 10 12 14 16 
Hour 


Figure 1.5. The average degree (the average number of interactions per student) over the 
course of a single day. The shaded regions correspond to the breaks between classes. 


Figure 1.5 shows, per snapshot, the average number of interactions per student 
over a given day. The observed peaks correspond to the starting and ending times 
of the breaks between courses, since students leave and join the classrooms at these 
moments. 


Communication networks 


Communication networks constitute an important class, which includes various 
transportation networks (roads, airplane maps, etc.) as well as phone and messaging 
communications between individuals. 

The Enron email dataset’ contains approximately 500,000 emails from about 
150 employees (mostly from the senior management team) of the Enron company 
(now bankrupt). Emails were recovered by the Federal Energy Regulatory Com- 
mission during the fraud investigation. This dataset was made public and has been 
used by many researchers for various information processing tasks, such as docu- 
ment classification or social network analysis (Carley and Skillicorn, 2005). 

The Copenhagen networks study dataset (Sapiezynski et al., 2019) records the 
interaction of 700 university students over 4 weeks, including close-proximity inter- 
actions, phone calls and Facebook friendships. 


Information and collaboration networks 


Co-authorship networks are constructed by connecting two authors if they have 
published a paper together. Since automated citation indexing is now common, 


large datasets of co-authorship networks are now available. Examples include the 
DBLP (Yang and Leskovec, 2015), Citeseer, Cora, WebKB (Getoor, 2005) and 


1. Available at https://www.cs.cmu.edu/~enron/ 
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Figure 1.6. Dolphin network (Lusseau et al., 2003). Colours show how the network split 
when a dolphin left the group. 


PubMed (Namata et al., 2012) datasets. This procedure can be extended to other 
domains. For example, using IMDB data one can produce a network of movie 
actors, where two actors are connected if they starred in a movie together. 
Web-graph represents another example of information networks. It is con- 
structed by linking webpage A to webpage B (usually with a directed link) if web- 
page A cites webpage B. Several Web-graph and Wikipedia networks are available 
from the NetSet database? and the Laboratory for Web Algorithmics (LAW).° 


Biological networks 


The class of biological networks includes protein interaction networks, food webs 
and animal social networks. 

Let us present one example of an animal social network. The dolphin net- 
work (Lusseau eż al., 2003) is a social network of 62 dolphins, with edges represent- 
ing social interactions. During the study, a dolphin left the group, which resulted 
in a split of the network into two communities (see Figure 1.6). The group later 
reunited when this mysterious dolphin returned home. 


Geometrically defined network topologies 


In machine learning tasks, data often come as a matrix 
xX = (rises sae) e R”*”, 


where 7 is the number of data points and m is the dimension of each data point 
(e.g., the number of features). To perform data analysis with the help of a network, 


2. https://netset.telecom- paris.fr/index.html 


3. https://law.di.unimi.it/datasets.php 
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Figure 1.7. Network constructed from 300 pictures of digits O, 1 and 2 taken from the 
MNIST database. 


the topology and weights of the graph must be built from the data. A common way 
to define the weight of an edge connecting vertices 7 and j is by using a Gaussian 


kernel with thresholding 


lx;—=x;ll? . 2 
ep (23L), ifle- yl? <x, 
wy = , 
0, otherwise, 
where t and « are tunable parameters and || - || is a distance between data points. 


In particular, the cutoff parameter x prevents having a too dense network with 
many small-weight edges. Another common method is to connect each vertex to 
its K-nearest neighbours. We refer to (Grady and Polimeni, 2010, Chapter 4) and 
(Stankovic et al., 2020) for the description of other methods for data similarity 
network construction. 

The MNIST database (LeCun eż al., 1998) is a database of 70,000 handwritten 
digits commonly used as a benchmark in machine learning. Figure 1.7 presents 
a network built from 300 pictures of digits 0,1,2 using the Gaussian kernel as 
a weight function. More precisely, we first compute a K-nearest neighbour graph 
(K = 8) with weights 


állx;—x;||2 : a : 
exp (- sa) , if xj is among K nearest neighbours of x;, 
Wij = á 


0, otherwise, 


where 1; represents the distance between x; and its Kth-nearest neighbour. The 
weight matrix is finally symmetrised by replacing W with }(W + W”). 
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1.2 Unifying Properties of Complex Networks 


1.2.1 What are the Properties Commonly Shared by Networks? 


Many real-world complex networks share a number of basic properties. 


Sparsity 


The degree of node i, denoted d;, is the number of edges incident to this node, or 
in other words, the number of nodes that are interacting with node 7. Even if the 
number of nodes 7 in a network can be large, the average degree d = 1 > diis 
often small. For example, in Table 1.2, we see that the DBLP co-authorship network 
has 13,326 nodes, while the average degree d is just 5.1. This effect is even more 
evident in social networks such as Facebook. Even if the total number of users is 
huge and still growing, the number of friends of each user remains small (maybe 
even bounded). We say that a network is sparse if the average degree d is several 
orders of magnitude smaller than the number of nodes 7. 


Connectivity 


A connected component of a binary undirected graph G = (V, E) is a set U of 
nodes such that between two nodes 7,7 € U there exists a path linking 7 to j. 
Since two connected components are necessarily disjoint, the node set V can there- 
fore be partitioned into a finite number of non-overlapping connected compo- 
nents Uj,..., Up. We say that the graph is connected if p = 1, and disconnected 
otherwise. Even though real-world networks might be disconnected, typically the 
relative size of the largest connected component is very large (for example, con- 
taining about 90% of the nodes), while the other components are much smaller 
(Newman, 2001a). 


Small world 


In a famous experiment, Milgram asked participants to mail a folder (contain- 
ing several documents related to the study) to one of their acquaintances in an 
attempt to eventually reach an assigned target individual (Milgram, 1967). While 
in most cases the individuals failed (either by incapacity or lack of willingness), 
about 20% of the participants managed to send the documents to the assigned 
target.’ Moreover, the mean number of intermediaries between starters and the tar- 
get was 5.2. While Milgram’s experiments were later criticized (Kleinfeld, 2002), 


4, This is astonishing. Milgram’s experiment was repeated using e-mail. Dodds eż a/., 2003 asked 24,163 vol- 
unteers to start e-mail chains, aiming to reach 18 target persons in 13 countries. Only 384 (less than 1.6%) 
of those chains were completed! 
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Table 1.2. Basic characteristics of a selection of networks. The quantities are: the number 
of nodes n, the number of edges |£], the average degree (the average number of neigh- 
bour nodes) d, the average distance between two nodes ô, clustering coefficient cc (in 
parenthesis the clustering coefficient if the edges of the graph were drawn randomly), 
the exponent of the degree distribution a. 


Network n IE] d ô cc a 
Political blogs 1222 16717 27.4 2.7 0.32 (0.07) 1.5 
citeseer 2110 3720 3.5 9.3 0.17 (0.005) 2.7 
cora 2485 5069 40 6.3 0.24 (0.005) 2.9 


LiveJournal 2766 24138 17.5 3.9 0.41 (0.02) 2.1 
wikischools 4403 100382 46 2.5 0.28 (0.03) 2.3 
DBLP 13326 34281 5.1 69 0.61 (0.001) 2.9 
wikivitals 10008 629521 126 2.4 0.26 (0.04) 2.7 


they transfused in popular culture as the six degree of separation phenomenon. 
In fact, this phenomenon has since been empirically observed in many networks 
(see Watts, 2000; Newman, 2001b and Table 1.2). 


Edge transitivity 


A popular saying tells us that “a friend of my friend is my friend”. Thus, one would 
expect the interaction in a network to be transitive. This means that if Alice inter- 
acts with Bob, and Bob interacts with Cecile, then Alice and Cecile have a high 
probability of also being in interaction. The clustering coefficient measures this phe- 
nomenon. We define a connected triple as a set of three nodes, where one node is 
connected to two other nodes. We also define a triangle as a set of three nodes that 
are connected to each other. Since each triangle of three nodes contributes three 
connected triples (one centred on each of the three nodes), the clustering coeffi- 
cient cc is given by 


3 x number of triangles 


cc = - : 
number of connected triples of nodes 


Consider a graph in which the interactions among nodes are purely random (i.e., 
an interaction between two nodes occurs with a probability p). Since there are (5) 
node sets of size three, the expected number of triangles is thus (3) p°, and the 
expected number of connected triples is (eps Hence, the clustering coefficient 
of a random graph equals 3p. Finally, since there are (5) node pairs each which 
interact with probability p, then p can be estimated by the fraction |Z|/(5). Hence, 


the clustering coefficient of a random graph can be estimated by ton. We observe 
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in Table 1.2 that the clustering coefficients of real-world social networks are several 
orders of magnitude higher than those of random graphs of same size. 


Heavy-tailed degree distribution 


Let us denote by p; the probability for a uniformly sampled node to have degree k 
and call {p4 : k = 0,1,2,...} the degree distribution. In a random network, where 
|E| edges are drawn uniformly at random among the (5) node pairs, the degree 
distribution is binomial with parameters 7, p, with ĝ = |E|/(5) being an estimate 
for the edge probability. Nonetheless, in most networks the degree distribution is 
highly right-skewed, in other words, has a heavy tail distribution for values that are 
far above the mean. This highlights the fact that there are a small number of nodes 
having very large degrees (for example influencers in a social network), whereas the 
majority of nodes have very small degrees. Therefore, it is in general more accurate 
to model the degree distribution of real networks by a power law. 

A random variable X € [xmin, +00) is distributed according to a continuous 
power law of exponent a, if it is drawn from a probability distribution whose den- 
sity is f(x) = Cx~%. While a > 1 is required for the probability distribution to 
be well-defined (and then C = (a — 1)x27! from normalisation), typical values 
of a often lie in the range 2 < a < 3. An important property of power laws is 
that they are scale-free (or scale-invariant), namely f (cx) œ f (x) for any constant c. 
As the degrees are integer values, we will consider the discrete variant of a power 
law, namely the Zipfian distribution, where P(X = k) = Ck“ 1(k > xmin) with 
C= (Frat 

While fitting power laws is complex as large fluctuations occur in the tail of the 
distribution (Newman, 2005b; Clauset eż al., 2009), it is convenient to notice that 
log P(X = k) = —alogk + loge for k > xmin, and thus with a log-log scale the 
probability distribution is a straight line. To reduce the effect of the aforementioned 
tail fluctuations, it is better to use the Complementary Cumulative Distribution 
Function (CCDF) for fitting instead of the density function. The Hill estimator 
also accurately estimates the exponent of a power law, see e.g., (Clauset eż al., 2009) 
for details. Figure 1.8 shows the power law of the Citeseer network. 

While the power-law paradigm has been widely accepted and is sometimes 
referred to as a ‘universal law’, it has also been heavily criticized. In particular, a 
linear regression on the log-log plot generates significant systematic errors under rel- 
atively common conditions (see Clauset et al., 2009, Appendix A). Moreover, Lima- 
Mendez and van Helden, 2009 showed that for biological networks the power-law 
degree distribution is a myth. Similarly, by applying goodness-of-fit tests on more 
than 1000 networks, Broido and Clauset, 2019 showed that networks with power- 
law degree distributions are actually rare. Nonetheless, a vast majority of real-world 
networks have heavy-tailed degree distributions. 
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Figure 1.8. The degree distribution of the Citeseer network. 


1.2.2 How do these Properties Arise? 


In order to explain how the described properties arise in networks, we introduce 
some random graph models with stochastic node interactions. The random graph 
models will be studied in detail in Chapter 2. These models will also serve as refer- 
ence for studying statistical problems related to networks. 


Erdés-Rényi random graphs 


The simplest random graph model is the Erdés-Rényi model. This model has n nodes 
and each pair of nodes is connected with probability p. 

This is a simple model, in particular since it assumes that interactions between 
different node pairs are independent. Hence, the model will not allow any of the 
edge transitivity. Morever, the degree distribution of an Erdés-Rényi random graph 
is binomial, Bin(”, p),’ which is not heavy tailed. 

Nevertheless, the Erdés-Rényi model allows us to illustrate the properties of con- 
nectivity and sparsity in a beautiful manner. Indeed, since the degree distribution 
is binomial, it follows that the average degree d of the nodes is equal to np. If p is 
constant, then it means that d scales with the number of nodes 7, and hence the 
graph is not sparse in this scaling regime. It is thus common to scale p with 7, such 
that p = py X 1. For example, by choosing p = $ with a constant, we have d =a, 
and the average degree remains constant as 7 grows. We will see in Chapter 2 that 
another interesting choice is pn = ae so that the average degree d = alogn 
grows logarithmically with 7. In Figure 1.9, two examples of Erdés-Rényi graphs 
are shown. We observe that when p, = 2 in (a), the graph is disconnected, że., 


a significant number of nodes are grouped into one connected component, while 


5. This is because a given node ż has 7 potential neighbours (7 — 1, if we exclude self-loops), and this node i 
is connected to another node with probability p. 
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(a) pn = 2. (b) pn = 228", 


Figure 1.9. Erdés-Rényi graphs with 7 = 100 and various interaction probabilities p,. 


(a) n = 200 


(b) n = 400 


Figure 1.10. Example of RGG, when S = [0, 1]? and r = 0.1, for different n. 


2logn 
n 
appears to be connected. We will see in Chapter 2 how rigourous statements con- 


in (b), the graph 


some nodes remain isolated. On the contrary, when p, = 


firm these observations. 


Random geometric graphs 


Edge transitivity can be modelled by introducing geometry. Let us consider n nodes, 
and assume that each node has a random position on the Euclidean plane. Intu- 
itively, nodes that are close to each other have more chance of being connected than 
nodes placed further apart. An extreme choice is to assume that two nodes are con- 
nected ifand only if their Euclidean distance is less than a threshold r. This gives the 
Random Geometric Graph model. We observe in Figure 1.10 that this model leads 
to graphs with a large number of triangles (compared with Erdés-Rényi graphs). 
Moreover, the graphs appear locally dense while remaining globally fairly sparse. 


Preferential attachment models 


While the Erdés-Rényi model explains sparsity and connectivity, and geometric 
graphs explain transitivity, none of these models exhibit a power law degree distri- 
bution. To model networks with scale-free degree distributions, Solla Price, 1965, 
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Figure 1.11. Left: A realisation of the preferential attachment model with n = 500 nodes. 
Right: Degree distribution in log-log scale of a preferential attachment graph with n = 104 
nodes. The orange curve represents the linear regression fitting. 


1976 (analysing citation networks) and Barabdsi and Albert, 1999 (analysing web- 
graphs) proposed the preferential attachment model. It is a growing network model 
in which a new node enters the network at each time step. The probability that the 
new node interacts with an existing node 7 is proportional to the degree d; of node 
i. Therefore, nodes with large degree tend to attract new edges, hence increasing 
even more their degree. We plot an example of a graph generated by the preferential 
attachment model in Figure 1.11, as well as its degree distribution. We will give a 
rigorous definition of this model in Chapter 2 and prove that this model indeed 
has a power-law degree distribution in the limit. 


1.3 What Are the Statistical Problems Related to 
Networks? 


1.3.1 How to Cluster Network Nodes? 


Community detection (also referred to as community recovery or graph clustering) is a 
very common problem in network analysis. It consists of grouping the nodes into 
K communities (also called groups, blocks or clusters), such that nodes inside a 
community have some similar properties. Intuitively, we shall assume that nodes in 
the same community are more likely to interact than nodes belonging to different 
communities.° 

Again intuitively, a good partition should minimise the number of interactions 
between different clusters. Consequently, a first class of graph clustering methods, 
called cut-based methods, aim to find K clusters such that the number of interactions 


6. This is sometimes referred to as associative communities. Nonetheless, some networks may be dissasociative. 
That is, interactions are more likely to occur between nodes in different communities. 
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between different clusters is minimised. It leads to several spectral methods that use 
the information contained in the eigenvectors of a well-chosen matrix to retrieve 
the communities. This beautifully links graph theory with linear algebra. 

Other clustering methods assess the quality of a given partition via certain criteria 
which they aim to optimise. One example of this class of methods is based on the 
concept of modularity. In essence, modularity compares a graph with clusters to 
some reference random graph model. The maximisation of modularity is usually 
done via a greedy algorithm. One strength of such methods is that it is not necessary 
to know the number of clusters in advance. 

Unfortunately, we will see that the modularity-based methods are prone to over- 
fitting. In particular, we will show that on random graphs with no community struc- 
ture, such as Erdés-Rényi random graphs, it is possible to find partitions with a high 
modularity! We will see how we can mitigate this problem by using Bayesian meth- 
ods. Those methods assume that the graph data is generated from a random graph 
model with a clustering structure and look for the best parameters via a Markov 
Chain Monte Carlo algorithm. 


1.3.2 Which Nodes are Most Important in a Network? 


In large networks, many applications require the ranking of nodes in terms of 
importance. Examples include the identification of the most influential nodes in 
social networks, the study of super-spreaders of a disease, and the analysis of bot- 
tlenecks in urban or technological networks (such as electric grids). While all these 
problems are related to finding the most important, crucial nodes, the notion of 
importance varies greatly. Indeed, the most influential nodes in a social network 
may simply be the nodes with the largest degree. For example, when creating an 
account on Twitter or Instagram, the online social networks suggest the new users 
to follow popular users. On the contrary, bottlenecks in an electric grid are located 
on nodes with small degree such that, if those nodes were not in the network, there 
would be a great change in the network flow. Finally, other applications, such as 
PageRank, rank the nodes based on a random walk on the network nodes, mim- 
icking browsing or searching behaviour. 


1.3.3 How to Infer Important Information in a Network? 


Analysing a very large network is often easier done via summary statistics. Some 
examples are: estimating the average age of users in a social network, finding the 
proportion of drug users in a population, polling before an election, etc. A first pos- 
sibility is to uniformly sample & nodes, and average over this sample. Unfortunately, 
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in practice, it is often hard to sample the nodes uniformly. Typically, uniform sam- 
pling in a huge social network like Facebook or Twitter cannot be done efficiently as 
(a) the list of all accounts on these platforms is not publicly available; and (b) there 
is a strict limitation on the API access rate. For instance, a standard Twitter account 
can make no more than one request per minute. At that rate, we would need about 
950 years to crawl the entire Twitter social network... 

Moreover, a small bias in the sampling process may lead to a very large bias in the 
estimator, as many famous examples involving polling before elections can attest. It 
is important to note that a bias in the node sampling cannot be mitigated by simply 
sampling more nodes. One infamous example involves The Literary Digest, who in 
1936 had polled more than two million individuals and wrongly predicted a clear 
victory of Landon over Roosevelt. The way of sampling created a bias, since the 
newspaper simply polled over its own readers, who were wealthier than the average 
citizen.” 


Book Organisation 


The book is organised as follows. We start by presenting various random graph 
models in Chapter 2. Chapter 3 focuses on centrality indices in networks. Com- 
munity detection problem is presented and analyzed in Chapter 4, and Chapter 5 
is devoted to semi-supervised learning on networks, when some information about 
the community structure is given. In Chapter 6, we extend the community detec- 
tion problem to temporal networks. Finally, in Chapter 7 we present techniques 
for sampling and performing questionaries in networks. 


Book Bibliographic Position 


Let us discuss the position of the book with respect to the other reference works. 
Random graph models are thoroughly analysed in Bollobds, 2001; Chung and Lu, 
2006; Janson et al., 2011; Hofstad, 2016. Graph formation processes (e.g., pref- 
erential attachment processes) and dynamics on graphs (e.g., epidemic processes) 
are studied in Durrett, 2007; Draief and Massoulié, 2010; Barabdsi, 2016; New- 
man, 2018; Masuda and Lambiotte, 2021. Specific applications of random graph 
and complex network models to social networks are discussed in Wasserman and 
Faust, 1994; Doreian et al., 2005; Carrington et al., 2005; Scott and Carrington, 
2011; Prell, 2012; Yang et al., 2016; Borgatti et al., 2018; Knoke and Yang, 2019. 


7. See https://en.wikipedia.org/wiki/The_Literary_Digest. 
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The fitting and visualization of random graphs and complex networks are presented 
in Ellson eż al, 2004; Hagberg et al., 2008; Bastian et al., 2009; Kolaczyk et al., 
2009; Goldenberg et al., 2010; Cherven, 2015; Mrvar and Batagelj, 2016; De Nooy 
et al., 2018; Kolaczyk and Csárdi, 2020. We do not cover the above topics in detail. 

Our emphasis is on fundamental statistical aspects of complex network analy- 
sis (aka network science). Graph clustering and community detection, in partic- 
ular clustering of stochastic block models, are studied in Newman, 2018; Abbe, 
2018. This is still a very rapidly developing research area, with many interesting 
new results continuing to appear. Here, we summarise the main results in the com- 
munity detection problem, review important progress since 2018 and study cluster- 
ing in temporal networks. Semi-supervised learning is presented in Chapelle et al., 
2006. In this book, we focus on graph-based semi-supervised learning methods and 
their application to temporal networks. 

To the best of our knowledge, there are no textbooks about the detailed analysis 
of network centrality indices (especially about their comparative analysis and their 
various applications beyond the scope of social networks). As is the case for the 
community detection problem, new important results continue to emerge. We have 
tried to do a state-of-the-art survey in this area. Also, we have not seen any textbook 
about modern methods for sampling in networks. 

Thus, we hope that this is the first comprehensive textbook-style exposition of 
the statistical analysis of networks. 
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“Distributed Learning and Control for Network Analysis” and EU COST Action 
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Random Graph Models 


This chapter is devoted to basic models for complex networks. We introduce sev- 
eral important classes of random graph models and we illustrate and study some 
statistical properties of these models, such as degree distribution and connectivity. 


Notations In the following, G = (V, E) denotes a graph, where V = {1,..., 7} 
is the set of vertices (nodes) and £ is the set of edges (links). We say that a graph G 
is a random graph if G was generated from a random graph model. A random graph 
model refers to a probability distribution over the set of all graphs. 

We will denote by d; the degree of node 7. The vector d = (dj,...,dn) is 
called the degree sequence of the nodes. Given a random graph model, the degree d; 
of a node 7 is a random variable and is distributed according to some probability 
distribution. When all the degrees are identically distributed (że., d1,- +- , dn are 
all distributed according to the same probability distribution D), we say that the 
degrees in the graph G are distributed according to the degree distribution D. 
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2.1 Erddés-Rényi Random Graphs 


21.1 Definition 


Definition 2.1. Let n be an integer, and P = (py) 1 <i<j<n € [0; 1]”*” be a set of 
probabilities. A Bernoulli random graph G = (V, E) is an undirected, unweighted 
graph G such that: 


e V=({l,...,n}; 
© P((j) € E) = pj for all node-pair (1,7) with 1 <i <j < n. 


We write G ~ G(n, (p;)). In a Bernoulli random graph, every node pairs (7,7) is 
connected by an edge with probability p,;, independently of all other node pairs. 


Remark 2.1. If G ~ G(x, (pj)1<i<j<n), then the adjacency matrix A of G is a 
symmetric random matrix, whose entries are independently distributed, with Aj; = 
Aji on Ber(p;) and Aii = 0. 


Proposition 2.1. Let G ~ G(n, (p;;)) and A be its associated adjacency matrix. We 
have: 


Ag = 
P(A) = [| [p70 -p 


i<j 


Proof: The independence of the edge sampling process ensures that 


P) = || P(4s). 


1<i<j<n 


Moreover, 


ra= i if Aj =0° 


; : R Aj a 
and this can conveniently be rewritten as P(4;) = pj C = Pi)’ Aj, 


Example 2.1. Suppose that Vż, j : pj = p. Then, G(x, (p;))) is called the Erdős- 
Rényi model’, and traditionally denoted by G (7, p) or Gap- 


1. This model was first introduced by Gilbert in 1959 (Gilbert, 1959), while the same year a paper from Erdős 
and Rényi study a similar but different model (Erdős and Rényi, 1959), where all graphs on a fixed vertex 
set with a fixed number of edges are equally likely. Asymptotically, these two models are equivalent. 
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Corollary 2.1. Let G ~ Gap and A be its associated adjacency matrix. We have 


n(n—1 IEI 
pa) = a-p“ (2). 


where |E| is the number of edges of G. 


Proof. Using Proposition 2.1, we can write 


Aij 
ra = ra- = a-a (5) 


i<j i<j 


The result follows by noticing that |E| = >); <; 4j» 


Algorithm 1 provides a simple way to generate an Erdés-Rényi random graph, 
by looping over all possible node pairs (z, j), and adding (ż, j) to the edge list with 
probability p. 


Algorithm 1: Simple generation of Erdés-Rényi graphs. 
Input: number of nodes n, edge probability p € [0, 1]. 
Output: list of edges £. 

Process: 
EQ; 
for i = 1 to n-1 do 


for j = i+] ton do 
x < random number between 0 and 1; 


if x < p then 
|_add the edge (7,7) to E. 


Return: £. 


The space-complexity of Algorithm 1 is O(|£|) (corresponds to storing |E| 
edges), while its time-complexity is O(n”). In particular, it is very inefficient if 
p is small: indeed, in that case, the majority of node pairs (, 7) will not be con- 
nected, and we are wasting time by testing them. In other words, starting from 
node 7, the node pairs (ż, ¿+ 1), . . ., (4 ¿+ k — 1) will not be linked, while the pair 
(i, i + k) will give an edge. This number & represents the number of failures in a 
sequence of independent Bernoulli random variables before the first success occurs. 
Hence, it is geometrically distributed with parameter p. Based on this observation, 
Batagelj and Brandes, 2005 proposed Algorithm 2 for an efficient generation of a 
sparse Erdés-Rényi graph. It has both space and time complexity of O (|El). 
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Algorithm 2: Fast generation of sparse Erdés-Rényi graphs. 
Input: number of nodes 7, edge probability p € [0, 1]. 
Output: list of edges £. 

Process: 
Ec ; 
í < 0; 
for i = 1 to n — 1 do 
vei 
while v < n do 
k < realisation of a geometric r.v. with parameter p; 
v & v+ k; 
if v < n then 
|_add the edge (ż, j) to £. 


Return: E. 


2.1.2 Degree Distribution 


Proposition 2.2. Let G ~ G(n, p), and let di be the degree of node i. Then d; is 
distributed according to Bin(n, p). In particular, the average degree d of the graph 
equals np. 


Proof. Indeed, the degree of 7, denoted dj, is equal to 2 Aj, where Aj are i.i.d. 
Bernoulli random variable with parameter p. 


Remark 2.2. It has been observed that many real graphs have a heavy-tailed degree 
distribution (such as a power law), and not a binomial one (we refer to the dis- 
cussion in Section 1.2). An intuitive argument is the following one: since bino- 
mial distributions are well concentrated, an Erdés-Rényi graph does not allow 
for many hubs (nodes with degrees much higher than the average degree), which 
we tend to see in real networks (e.g., in a social network, some people will have 
many more connections than others and will act as influencers or hubs). Thus, 
the basic Erdés-Rényi random graph is not an appropriate model for many real 
networks. 


2.1.3 Phase Transition Phenomena 


Heuristic 


This section considers sequences of Erdés-Rényi graphs (Gj,... ..), such 
that G, has 7 nodes, and the link-probability p, depends on z. In “ae words, 
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(a) dn = 0.5 (b) d, =1 (c) dn = 2 


Figure 2.1. Erdés-Rényi graphs with x = 100 in the constant degree regime. 


(a) dn = 0.5logn (b) dn = logn (c) dn = 2logn 


Figure 2.2. Erdős-Rényi graphs with z = 100 in the logarithmic degree regime. 
Gn ~ G (n, pn). We especially highlight two regimes: 
e the regime p, = $, where a is constant; 


logn 


n? 


© the regime p, = a where a is constant. 


These two regimes are respectively called the constant degree regime and the log- 
arithmic degree regime, as the expected degree dy = mp» equals a in the first 
case and 4log z in the second case.” Figures 2.1 and 2.2 show examples of Erdős- 
Rényi graphs in the constant and logarithmic degree regime respectively, for a given 
n = 100. We make the following observations in the constant degree regime: 


e when d, < 1, most of the nodes are isolated, as expected since dy, < 1 means 
that on average, a node has less than one neighbor; 

e when d, > 1, it seems that there is a connected component which contains 
most of the nodes. We call this component the giant component. 


On the other hand, in the logarithmic degree regime, we see that: 


e ifd, < logn, the graph appears to be not connected, as there remains some 
isolated nodes or isolated edges; 
© on the contrary, when d, > log n, the graph appears to be fully connected. 


Those observations are further strengthened by Figure 2.3. In the constant degree 
regime pn = $, Figure 2.3(a) shows that when a < 1, the proportion of nodes in 


2. More exactly, d, = (n— 1)pn but if we allow self-loops, we can write dn = npn, and moreover the difference 
is negligible for large 7. 
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Figure 2.3. Empirical evidence of a phase transition for the existence of a giant compo- 
nent in the constant degree regime and a phase transition for connectivity in the loga- 
rithmic degree regime of Erdés-Rényi graphs (here n = 5000). 


the largest connected component is tiny. But, as soon as a > 1, this proportion 
becomes non-negligible, and increases steadily with a. Similarly, in the logarithmic 
degree regime p, = qe", Figure 2.3(b) shows that the empirical probability that 
the graph is connected goes from 0 to 1 as soon as a becomes larger that 1. 


Main statements 


Let us now present two main statements justifying our previous heuristic observa- 


tions. 


Theorem 2.1 (Phase transition for giant component — constant degree regime). 
Let G ~ G(n, pn) be an Erdés-Rényi graph, with pn = © where a is a constant. Almost 
surely, the following holds: 


(a) ifa < 1, then there is no connected component of size larger than O(log n); 

(b) ¿fa = 1, then there is one large component of size O(n*!*); 

(c) ifa > 1, then there is one and only one component of size O(n). This component 
is called the giant component. 


The proof of Theorem 2.1 is complex and will not be presented in this book. 
We refer the interested reader to (Hofstad, 2016). 


Theorem 2.2 (Phase transition for connectivity). Let G, ~ G(n, pn) be a sequence 
of Erdés-Rényi random graphs. Let dn = npn. The following holds. 


(a) If there exists a sequence (On)n with @n > +00 such that d, < logn — Op, 
then Gn is a.s. non connected. More precisely, the graph Gq contains a.s. at least 
one isolated node; 

(b) If there exists a sequence (Wy) n With @n — +00 such that dn > logn + Oy, 
then G, is a.s. connected. 
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Example 2.2. Assume d, = logn + loglogn, and let G, ~ G(n, py). Then, 
Theorem 2.2 states that asymptotically G, will be a.s. connected (we can take œ, = 
log log 7). 

Example 2.3. If d, = alogn, with a constant, then Theorem 2.2 applies with 
@n = (a— 1) log n. Hence G, will be connected if a > 1, and will be disconnected 
if a < 1. In particular, this justifies the heuristic observation from Figure 2.3(b). 


Proof of the connectivity phase transition 


Before proving Theorem 2.2, let us start with the following lemma about the pres- 
ence of isolated nodes in an Erdés-Rényi graph. 

Lemma 2.4. The probability that an Erdés-Rényi graph Gn ~ G(n, pn) contains at 
least one isolated node satisfies 


l 
nai for some @, —> +00, 


IV 


0 ifp, 
lim P(A isolated node) = rae 


A 


“ for some @, > +00. 


log n—a@, 
n 


. : . . log n— ; : 
This lemma implies that ifp, < 25%, then the graph contains a.s. an isolated 
node, and hence the graph is a.s. not connected. This precisely corresponds to the 


statement (a) of Theorem 2.2. 


Proof of Lemma 2.4. Denote by A; the event that “node 7 is isolated”, and let 7, = 
XLo 1(4;) be the number of isolated nodes. Recall that dp = mp, is the mean 
degree. We have 


= 1 
P@M) = A- Pn)” i i= 2p ~ exp (—d,) os a PPF), 


and thus 


n 


= XPA) = aP) ~ T, 


i=0 


(i) Ifd, = log n+ @,, we have E (Un) ~ e™®nr — Q. Since the expected number 
of isolated nodes goes to 0, we can conclude the proof using the first moment 


method. Indeed, recall that by Markov inequality (see Proposition A.6 and Corol- 
lary A.2), we have: 


Pa isolated node) = PU, > 1) < =" — 0. 


(ii) Ifd, = log n — @,, we have E (Zn) ~ etn —> +00, and hence the expected 
number of isolated nodes goes to infinity. Unfortunately, this is not enough to 


conclude anything about the probability of existence of an isolated node, and we 
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will need the second moment method. Indeed, we have to show that the random 
variable 7, is well-concentrated around its mean. Since its mean diverges to infinity, 
the result will follow. For that, we will use Chebyshev’s inequality (Proposition A.7). 
We have 


Var (I,) = E(Z?) — Œ. 


Note that 


Yai 

Rang 

— 
II 


2 | DD Ad) 
i j 
> DLP (44) 
i j 


nP(A1) + a(n — 1)P (4: N4). 


Here we need to be careful, since Aj and A) are not independent. Indeed, knowing 
that node 1 is isolated means that there is no edge between node 1 and node 2, and 
thus weakly increases the probability that node 2 is isolated. We have: 


P(A, N42) = P(4 |4) P(A1) 
d — pn)” * P(A) 


r ea)’, 


a) 


> > PP) 
i j 

> D Pa’ 

i j 


= r P(A). 


since P(A) = (1—p,)”"!. Lastly, 


(E)? 


Putting all pieces together leads to 


Var(In) = nP(A1) + a(n — 1)P(41 N42) — PPA)? 
= ne(A) + pa) — PPa) 
2 
< nP(41) + ———P(A1)” - P(A) 


n 
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_ 2 aft 
= nP (A1) + 9° P(A)” (1 1) 
= E (4) + Œ) E 
— Pn 
Thus, by the second moment method (see Corollary A.5) 
Íy 1 P 
PU, =0) < Var (ly) p p 


(Ea) EU) T= 


Since EJ, — œ and p, — 0, this last quantity goes to zero when 7 goes to 


infinity. 


We can now prove the part (b) of Theorem 2.2. 


Proof of Theorem 2.2(b). Suppose that d, > logn + @,. Then, Lemma 2.4 shows 
that the number of isolated nodes 7, is zero. To show that G, is indeed connected, 
we need to show that 


P (G, is disconnected and J, = 0) > 0. 


If G, is disconnected and has no isolated nodes, then G, contains a connected 

component Cg of size 2 < k < |n/2]. Directly counting the expected number of 

components of size k is difficult, as the probability of them depends on the exact 

ma of edges they contain (which can be as low as k — 1 if C% is a tree, up to 
=i 
2 


contains spanning trees. By spanning tree of Cz, we mean sub-graph of Cz which is 


if Cz is complete). To avoid this issue, we will notice that the component C4 


a connected tree that contains all the vertices of Cz. Note that Cz, can contain more 
than one spanning tree. 

Let us denote by X; the number of spanning trees of size k. By the previous 
observation, X% is larger than the number of connected components of size k. 
Moreover, if G, is disconnected and has no isolated nodes, then there must be 
ak e {2,...,|/2]} such that X, > 1. Hence by the union bound and the first 
moment method, 


[7/2] 
P| U %2 
k=2 
|n/2] 
> Pœ!) 
k=2 
[n/2] 


5 EX}. (2.1) 


k=2 


P (G, is disconnected and 7, = 0) 


lA 


lA 


lA 
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We need to bound 


HX; Firstly, there is (3) ways of choosing & vertices (v1, . . . , v4) 


among the 7 nodes. Once these & vertices chosen, then by Cayley’s theorem [see 
Theorem 3.17 of Hofstad, 2016], there is possibly %7? trees containing those 
vertices. Since those & vertices form a tree within G,, they are linked by k — 1 


edges, which has a probability pt! of occurring. Lastly, the graph G,, should not 


include any edge between the tree and the rest of the graph: this has a probability 
(1 — p) 0A, To summarize, 


EX, = (a pN. 


Applying the Stirling bound 4! > #*e~*, we have (i) < (ne/ k)*. Moreover (1 — 


pny k) < eTPnkln—k) < e—knpn/2 and npn > 1. Thus 


Note that the function f(x) = xe 


EX, < ni (rpne) T eHe < n (npn). 


1—x/2 iş decreasing for x > 2. Since npn = 


logn + @n, we have for n large enough np, > logn. Hence 


and for any m > 1, 


Ln/2| 


2, EX, 


k=m 


elogn 


k l k 
EX, < n (log ne! =18”/2) < „($ ar) , 


2./n 


elogn\™ 1 elogn\” 
=n 2/n elogn < 2n 2./n 
1- Se 


using = mE 5 for n large enough. The previous bounding is rough, but enough 


s(n /2] a . p i 
to show that Ln/ | EX, converges to zero. Showing that EX; goes to zero is imme- 
k=2 k 8 8 g 


diate, and going back to Equation (2.1) it follows that 


This proves the statement (b) of Theorem 2.2. 


P (Gn is disconnected and J, = 0) > 0. 


2.2 Other Random Graph Models 


2.2.1 Configuration Model 


In this section, we 


sequence d = (dj,.. 


aim to construct a random graph G,, fitting a given degree 


.,d,). This means that the graph G, should have 7 nodes, 
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and the edges are drawn such that node 7 has degree dj. Let us make a few 
remarks. 


e We can suppose d; > 1, as d; = 0 means that node / is isolated. 

e It is not obvious that there exists a graph verifying the degree requirement. 
In fact, such a graph does not necessarily exist. For example, if we assume 
that the graph is unweighted, then >7”_, d; should be even (since this sum 
corresponds to two times the number of edges). 

e Even if we assume that >°”_, d; is even, the construction of such a graph 
might not always be possible. To avoid those issues, we will allow self-loops 
and multi-edges. 


Definition 2.2 (Configuration model). Let d = (d),...,d,) be a sequence such 
that >°"_, d; is even. At each node i € {1,...,}, we attach d; half-edges (also 
called stubs). We then pair the stubs by pairs of two, uniformly at random. The 
resulting graph is called configuration model with degree sequence d, abbreviated as 


CM, (d). 


This model allows multi-edges and self-loops. Moreover, by convention a self- 
loop counts for two in the degree of a node, since it comes from two stubs. 
Algorithm 3 generates a configuration model graph. 


Algorithm 3: Generation of a configuration model graph. 
Input: degree sequence (d4, . . . , dn). 
Output: list of edges £. 
Process: 


if >)" | d; is odd then 


return an error. 
else 
E & ; 
LeQ; 
for i = 1 to n do 
k = l] to d; do 
L add ż to the list L. 
shuffle the elements of L; 
j <0; 
while j < |Z| do 
add the edge (Z{j], L[j + 1]) to £; 
i —jt2. 


Return: E. 


28 Random Graph Models 


(a) d=2 (b) d=3 (c)d=4 


Figure 2.4. (n,d)-random regular graphs for n = 100 and various d. 


(a) œa = 1.5 (b) a=2 (c) a = 2.5 


Figure 2.5. Configuration model with z = 100, where the degrees d; are sampled indepen- 
dently from a Zipfian distribution with exponent a. 


Example 2.4. Ifd, = --- = d, = d, then we obtain a random (n, d)-regular graph 
(i.e, a random graph with 7 nodes where all the nodes have the same degree d). 
We plot some examples in Figure 2.4. 


Example 2.5. A random variable X follows a Zipfian distribution with parameters 
nand s if X e {1,--- ,n} almost surely and P(X = $) = C7'k™ for k e [7], 
where C = J% &% is a normalisation constant. In Figure 2.5 we plot some 
graphs drawn from the configuration model, where the d; are sampled from Zipfian 
distributions. 


2.2.2 Preferential Attachment Model 


Motivation 


Previous models are static, in the sense that the number of nodes is fixed. More- 
over, they do not explain how interesting properties (heavy-tailed degree distribu- 
tion, etc.) arise in real graphs. This section provides an example of growing random 
graphs, where nodes and edges are added over time. 

A first possibility is to construct a graph sequence (Gy) yen such that each G; is 
an Erdés-Rényi graph G (7, p). The graph G,41 would be constructed from G, as 
follow. The edges in G, are copied to G,,41, while edges of the form (2, n + 1) (for 
i = 1,--- ,m) are added independently with probability p. The new graph G,+1 
is thus a G,,41,, and G, is a sub-graph of G,,41. The problem is that the degree 
sequence is binomial, hence does not fit what we observe in most of real networks. 
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The preferential attachment paradigm offers an intuitive explanation behind the 
power law degree distribution that we seem to observe in reality. In this paradigm, 
a new node n + 1 will be connected to the 7 existing nodes by some additional 
edges. Theses new edges (i, + 1) are drawn independently with a probability 
proportional to the degree of the vertex / at that time. Thus, the new node 7 + 1 is 
more likely to be connected to a node with a large degree. 


Definition 2.3 (Preferential attachment — Informal definition). At time ¢, a new 
node will be connected to an old node 7 with a probability proportional to the 
degree d;(t) of the old node (at time ż). 


With that definition, we can make the following remarks: 


e the old nodes will tend to have higher degrees than the new ones; 
© the rich gets richer phenomenon: new nodes tend to be attached to high degree 
old nodes. In particular, we expect the formation of hubs. 


The fact that the graph will have hubs tend to make us think that the degree distri- 
bution will not be binomial, but may instead exhibit a power law. We will establish 
this in Proposition 2.3, just after giving a careful definition of the model. 


Remark 2.3. The term preferential attachment comes from Barabdsi and Albert, 
1999, who proposed a similar model, albeit not rigorously defined. Their model 
was actually close to the older works by Yule, 1925 and Solla Price, 1976. For a 
fully rigorous treatment, we refer to Bollobds et al., 2001 and Hofstad, 2016. 


Model definition 


Definition 2.4. A sequence of graphs [c = (V, E), t€ N} is said to be drawn 
from the Preferential Attachment Model if: 


e |Vi| = l and |F| = 1: at time step ż = 1, we have one node v; with a single 
self-loop; 

e attime step ż+ 1, we add the node v;+1 to the graph. This node will be linked 
to one (and only one) node. The probability that the new node is connected 
to node v; is given by 


1 : = 
+1 if Vi = Vt+1 


P(w vi) € En|G:) = (2.2) 


d;(t) : 
a otherwise, 


where d;(¢) is the degree of node v; at time ¢ (recall that by convention, a 
self-loop increases the degree by 2). 


We present in Figure 2.6 some graphs drawn from the preferential attachment 
model. 
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(a) T=10 (b) T = 100 (c) T = 500 


Figure 2.6. Graphs drawn from the preferential attachment model, for various 7. 
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Figure 2.7. Degree distribution of a graph drawn from the preferential attachment model, 
with T = 104. Left: normal scale. Right: log-log scale. The orange slope represents the 
curve y = —3x+ 11.9 obtained by linear regression fitting. 


Lemma 2.5. After t time-steps, the preferential attachment model results in a network 
with |V,| = t nodes and |E,| = t edges. In particular, Equation (2.2) defines a 
probability distribution. 


Proof. Indeed, at each time step, we add one node, so |V;| = +t. Moreover, we 
add only one edge per time step. Last, since Xi; d(t) = 2|E;| = 2t, then 


2 P(w vi) E Exs1|G:) = |; 


Remark 2.4. A more general version of the preferential attachment model is 
described in Hofstad, 2016. Definition 2.4 corresponds to the case m = 1 and 
ô = 0 there. 


Degree distribution of the preferential attachment model 


Let us now investigate the degree distribution of a graph drawn from the preferential 
attachment model. Figure 2.7 shows the histogram of the degrees. In particular, we 
see that in a log-log scale the curve seems to be linear. Let Nz be the number of 
nodes having degree k. Figure 2.7(b) seems to indicate that log N; = —a log k+ C 
where a = —3 and C is a constant. This in turn implies Ng œ 7%, i.e., the 
degree distribution follows a power law with exponent 3. This is indeed proved in 
Proposition 2.3. 
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Proposition 2.3. When t > +00, the preferential attachment model exhibits a power 
law degree distribution with exponent 3. 


Proof: Lets € {1,...,¢}, and denote by p(k, s, t) the probability that the vertex vs 
has degree & at time ¢. The evolution of p(&, s, t) is described by the master equation 


p&st+) = t- pk 1,s, t) + (: — =a) plks, t), (2.3) 


with the initial econ p&, 1,1) = ô; and the boundary condition p(k, t, t) = 


0g, The term = 


rT = represents the probability that the new node v;+1 is linked to 


node v, at time ¢ + 1 (thus increasing the degree of s by 1), and (1 — =) is the 
probability that the new node v;+1 is not linked to node »,. 

Let P(&, t) denote the total degree distribution of the entire network, that is the 
average of p(k, s, t) over all nodes v, € [ż] present at time ¢. We have 


1 t 
P(k,t) = - k, s, t). 
(k, t) 22 5,1) 
Using equation (2.3), yields 


(¢+1)P(4&,t+1) = 


— tP(k—1,t) + (1 — ):P&, t). 


2t+1 
Therefore, the time evolution of P(k, t) can be written as 
(e+ P(e +1) — Pl) = > a lé ~ 1)P(k — 1,2) — APC, i) bih 
When t — +00, this equation for the stationary distribution reduces to 


P(E) + — (rW - k- DPE- 1) = Dia: 


where P(&) stands for lim P(k, t). This last equation is the discrete version of 
f= +00 


the differential equation 


1 dkP(k) _ 
P ane TE 3 
O+ >a 
whose solution is 
P(k) = Ck, 


[00] 
with a normalisation factor C such that X. P(k) = 1 (że, C= X. k7). 
k k=1 
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Remark 2.5. The above proof is not totally rigorous, as it involved some approx- 
imations that need rigorous justification. However, it explains well the essence of 
the preferential attachment process. We refer to Hofstad, 2016 for a more mathe- 
matically involved (but rigorous) proof, as well as some other deeper results on the 
preferential attachment model. In particular, more involved models (with the new 
nodes attached to several nodes or/and with several new nodes at each time step) 
result in power laws with various exponents. 


2.2.5 Spatial Networks: Random Geometric Graphs, etc 


In many situations, nodes are positioned in a metric space (e.g., IR? or the sphere 
S2), and an interaction between two nodes directly depends on how far away the 
two nodes are in this space. Examples include base stations in wireless and sensors 
networks, in which two devices will be connected if they are not too far from each 
other. Moreover, in many networks, nodes possess attributes or features (e.g., gen- 
der, age, a grade, a type, ...) which can also be represented as a position in a metric 
space, and influence link formation. For example in social networks, users of similar 
age and/or gender are typically more connected. 


Definition 2.5. The Spatially Embedded Random Network (SERN) model is 
defined as follows. Let (S, d) be a metric space, and (Xj,...,Xn) be a random 
vector representing the locations of n nodes in S. Let y : Rt — [0, 1] be a connec- 
tivity function. Then, for every node pair (7, j), we draw an undirected edge between 
i and j with probability y (d (X; Xj); where d (X;, X;) denotes the distance between 
nodes 7 and j. 


Example 2.6. In the Random Geometric Graph model (RGG) it is assumed that 
Xi, ... Xn are iid. and uniformly distributed in S, while y (x) = 1(x < r). In 
other words, two nodes are connected if and only if the distance between them is 
less than some threshold r. 


We plot in Figure 2.8 some examples of RGG. We notice that when z is large, 
the network is composed of a few densely connected parts, with empty regions 
between them. Moreover, the graph is not small-world, as it takes a lot of edges to 
join two nodes that are far away from each other. 


Example 2.7. The Waxman model is a SERN where X is uniformly distributed in 
S, and y (x) = min (1, ge“) where q,a > 0 are some parameters. 


Figure 2.9 shows some realisations of Waxman graphs, and we observe different 
behaviour than that of RGG. In particular, Waxman graphs look like small-world 
networks. Indeed, and in contrast to random geometric graphs, nodes that are far 
away can still be connected with a small but non-zero probability. 
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(a) n = 100 (b) n = 200 (c) n = 400 


Figure 2.8. Example of RGG, when S = [0, 1]? and r = 0.1, for different n. 


A Pg 
< 


(a) n = 100 (b) n = 200 (c) y(x) = min(1, 0.1e75?) 


Figure 2.9. Examples of Waxman graphs, when S = [0, 1]?, q= 0.1 and a = 5, for different 
n. Figure (c) shows the connectivity function y (x) = min(1, ge~**). 
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Figure 2.10. Examples of Waxman graphs, when S = [0,1]?, a = 100 and q = 2200, for 
different ». g is chosen to be approximately equals to e#*9.1 hence making the graphs 
look like RGG with cutoff r = 0.1. Figure (c) shows the connectivity function y (x) = ge~**, 
which indeed looks like x +> I(x < r). 


Finally, a RGG with threshold 7 can be expressed as a limit of a Waxman model, 
when a — œ and q = e®” (see Figure 2.10). 

In spatial networks as defined in Definition 2.5, while the random variables 
(Aj) i<j are still pair-wise independent, they are in general no more mutually inde- 
pendent. Indeed, the presence of an edge between 7 and j and between j and k 
influences the probability of an edge between 7 and &. The simplest example is 
to consider a random geometric graph. Knowing that Aj; = A; = 1, implies 
d(X;,X;) < rand d(Xj, X4) < r, and therefore the triangular inequality implies 
d(X;, Xp) < 2r, i.e., node k cannot be arbitrarily far away from node ż, increasing 
the likelihood of having an edge between i and k. 


34 Random Graph Models 


Table 2.1. Basic properties verified by the models presented in this chapter. Note that 
many of those properties only hold under specific conditions (see e.g., Theorems 2.1 
and 2.2), and this table is only for a rough summary purpose. 


Erdés-Rényi CM PA RGG 


Connectivity / Giant component V v V v 
Small world Vv Vv Vv x 
Power law degree distribution x Vv Vv x 
Edge transitivity x x x Vv 


2.2.4 Summary 


We summarise in Table 2.1 the basic properties verified by the random graph mod- 
els presented in this chapter. 


2.3 Clustered Random Graphs: Block Models 


This section is devoted to clustered random graph models. This refers to situations 
in which each node has a community attribute, and these community attributes 
influence the probability of interaction. The block model paradigm considers that 
nodes are placed into communities (called blocks) and that the probability of a link 
between 7 and j depends on the community labels of 7 and j (and eventually on 
some extra features of 7 and j, such as their spatial position). 


2.5.1 Stochastic Block Model 


The Stochastic Block Model is the simplest and most studied clustered random graph. 
It is a direct extension of the Erdés-Rényi model. 


Definition 2.6. Let n be the number of nodes, K be the number of communities, 
m = (T1,..., TK) bea probability vector, and P be a K x K symmetric matrix 
whose entries are in [0, 1]. The pair (z, G) is drawn under the Stochastic Block Model 
(SBM) with parameters (x, 2, P) if: 


e z € [K]” is a random vector whose entries are independent and identically 
distributed such that P (z; = k) = mg; 

e Gis an undirected graph with 7 nodes, where the nodes i and j are connected 
with probability P, 


ij 


We write (z, G) ~ SBM(n, z, P). 


independently of other pairs of nodes. 


Figure 2.11 gives some examples of graphs drawn from the SBM. 
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(a) K =2 (b) K=3 (c) K=4 


Figure 2.11. Different SBMs, with 200 nodes per community and connectivity probabilities 
qkk = 0.05 and qg = 0.005 for k Æ £. 


Proposition 2.4. Forz € [K]” and k € [K], we denote Cj = {i € [n] : z; = k} 
the community sets given by the node-labelling z. Let (z, G) ~ SBM (n, 2, P). Then, 


a IC | 
k=1 
Aij 4s 
PGI = [| pe — pas)’ (2.4) 
l<i<j<n 
2 Il (pee) RO (1 spy (2.5) 
1<k<€<K 


where Nyela) = Dii<icjen AG = DCG = k) 1 (z; = £) is the number of edges (if 
a = 1) or non-edge (if a = 0) between the communities k and £. 


Proof. By independence of the node community labels, we have 


n K Icz| 
P@ = || = []z;,*. 
i=1 


k=1 


Then, equation (2.4) is a consequence of Proposition 2.1. 


Remark 2.6. The adjacency matrix of SBM (x, z, P) can be seen as a block matrix, 
whose blocks are Erdés-Rényi graphs. This is in particular useful to efficiently sim- 
ulate sparse SBM (see Algorithm 2, or the networkX or iGraph implementations). 


Definition 2.7. We call homogeneous (or symmetric) SBM an SBM such that 
_ | Pin if z =z; 
Paiz = Pout otherwise. 
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Proposition 2.5. Let (z, G) be sampled from a homogeneous SBM. Then, 


a y marl) 
P(G|z) = (=) (l—pin) 7 x 


z| | Cz NZ 
x (- pon )iskten T Ic ( Pout 1 2n) out 
1 — Pin 1 — Pout Pin 


where N® = Dicicjen 1(Ay = 1)1 (z; F z) is the number of inter-community 


out 


edges. 


Proof. From equation (2.5) we have 


P (G | z) — I] (pee) (1 — pie) Xe 


1<k<€<K 


We notice that Nze (0) + Nee) = die; lei = &)1(z = £). Hence, 


i<j 


|c] - |C?| if k#E, 
Nee) + Nee) = 4 Ice]. qai- 1) l 
r A e otherwise, 
and 
P (G | z) = (1 D a — Pout)! x 
KN N, 
x ( Pin pe (2) ( Pout sft wC) . 
1 — Pin 1 — Pout 
K K 2 
Since 5° 1Gfl =m then ( È |c) =a -2 5 baie 
k=1 k=1 1<k<€<n 
n(n—1) 1 — Pout x Iclc 
P G = ]= in 2 1<k<l<n 
a l k ) ( 1 — Pin ) i 
X N, N; 
x ( Pin a a ( Pout ae oe 
1 — Pin 1 — Pout 


Finally, since NZ, = D i<k<e<K Np(1) and |E] = Di<ececx Nee() we have 
D Nep(1) = [E| — Nv and the proposition statement holds. 
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Proposition 2.6. Let (z, G) ~ SBM(n, 2, P) be a homogeneous SBM graph with 


uniform node labels, i.e., T = (4, er xz): Then, the expected degree d of any node is 


Proof. It is similar to the proof of Proposition 2.2. A given node has 7 — 1 poten- 
tial neighbors in its community, and 7(K — 1) potential neighbors in the other 


communities. 


2.3.2 Degree-corrected Stochastic Block Model 


The results established for Erdés-Rényi Model (giant component, connectivity, 
etc.) are also valid for the Stochastic Block Model. Moreover, the limitations men- 
tioned for the Erdés-Rényi graphs also apply for SBMs, and in particular the lim- 
itation of the degree distribution. To introduce more heterogeneity in the degree 
distribution, Karrer and Newman, 2011 proposed the degree-corrected SBM. In 
this model, each node ż has, in addition to its community label z;, a degree param- 
eter 6; modelling its popularity (ze., the propensity of node 7 to make links). The 
formal definition is as follows. 


Definition 2.8. Let n be the number of nodes and K be the number of commu- 
nities, 7 = (,...,a@xK) be a probability vector, and P be a K x K symmetric 
matrix. Furthermore, let @ = (04, . . . , 0y) € R4 be the vector of degree-correction 
parameters. The pair (z, G) is said to be drawn from the Degree-corrected Stochastic 
Block Model (DC-SBM) with parameters (7, 2, P, 0) if 


© z = (%,...,Z,) € [K]” is a random vector whose entries are independent 
and distributed according to 7; 

e conditionned on z, G is an undirected graph with 7 nodes, where nodes i 
and j are connected with probability min (8;0;Pz;z;3 1), independently of other 
node pairs. 


In the following, we will always suppose that 0;0;Pz;z; < 1 for every (7,7). From 
Definition 2.8, we notice that multiplying all the 6; for 7 such that z; = k by 
a constant c, and dividing Pye by c if k # € and Py by c? leads to the same 
model. Therefore, after sampling the community labelling z, we normalise the 6;’s, 
such that >), 6;1(z; = k) = nay where nz is the expected number of nodes in 
block & With this normalisation choice, we recover the SBM model if 0; = 1 
for all 7. Moreover, the parameter 6; can be interpreted as the relative importance 
of node 7 in the graph. Another widely used normalisation consists in imposing 


Leah SH 1, 
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Similarly to SBM, we define a homogeneous DC-SBM when the entries of 
matrix P take only two values: P = pin and Pze = Pour for k # £. 


Proposition 2.7. Consider a pair (z, G) drawn under a homogeneous DC-SBM 
(a,m ,P,0). Leti € [n] be a node in community k. The expected degree of i is given by 


K 


Ed; = On >» 1 Pre. 
C=1 


Proof: Let A be the adjacency matrix of G. We have d; = 2 4% where the 
Aij (j = 1--+n) are independent random variables distributed as Ber (Paa): 


Hence, conditioning on z, gives 


n 


n n K 
Ehl = >) > 06? = &>) | =G | Pe. 


j=l j=l t=1 \j=l 


The result follows using the normalisation E pe 1(z = €)0; = nre. 


To make some computations easier, it is sometimes convenient to define a Pois- 
son Degree-Corrected Block Model. This refers to a random graph G with Poisson 
distributed edges. More precisely, A;; = 0, and for 7 4 j we have 


Ay = Aji as Poi (0;0;@z;z;)» (2.6) 


where Poi(A) denotes a Poisson random variable with parameter 2, whose proba- 
bility mass function is given by 


k 


P(X =k) = ie, k=0,1,2,.... 


The 6,’s are the degree-correction parameters and wyg is the edge density between 
blocks & and £. Note that A; is then an integer-valued random variable. Similarly 
to the DC-SBM, we assume that for all k € [K], >); 0;1 (z; = k) = ng. When all 
Ors are equal to one, we recover a Poisson version of the SBM, namely 


Ay = Aji ~ Poi (we) (2.7) 


While these random graph models differ from the standard SBM and degree- 
corrected SBM by allowing integer-valued edges, we notice that when nape « 1 
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the Poisson distribution is close to the Bernoulli distribution with the same param- 
eter 0;0;“", hence making the two models similar in practice. The Poisson frame- 
work is interesting as it makes some computations easier. In particular, for the Pois- 
son version we have 


Í 

= 
£ 
€A 


P (A|z,0, 0) (2.8) 


= LoF of | l wte i | | oye —NpneDpe 


IHA j! 1<k<K 1<k<l<K 
where ng = > ,1(g; = &) is the number of nodes in block k and mpg = 


ba jAg lz = k) (z; = £) denotes the number of edges going from community 
k to community £ (or twice the number if k = £). 


2.5.5 Popularity Adjusted Block Model 


While the DC-SBM allows us to accurately fit the degree distribution by enforc- 
ing the node degree parameters, it forces a popular node to be popular among 
all communities. Indeed, if 6; is large, then node i will be expected to have a lot 
of friends in every communities. The Popularity Adjusted Block Model (PABM) 
bypasses this restriction, by allowing node popularity to vary across both nodes 
and communities. 


Definition 2.9. Let 7 be the number of nodes, K be the number of communities, 
and z € [K]” be a node-labelling vector. Let A = (Aik) jetnj,ke[K] € (0; ie A 
graph G = (V, E) is drawn from the Popularity Adjusted Block Model if V = [7]; 
the edges are formed independently; and 


P (G) € E) = Aishja 


In other words, 2,4 is the propensity of node 7 to form links with a node in 
community &. 


Example 2.8. We recover the SBM by letting 4; = Py for every 7 € [n] and 
every k € [K]. 


Example 2.9. We recover the DC-SBM by letting 4; = 6;./Py for every i € [a] 
and every k € [K]. 
2.3.4 Soft Geometric Block Model 


Similarly to how the SBM extends the Erdős-Rényi model, the Soft Geometric Block 
Model (SGBM) extends the soft geometric random graphs or SERNs. 


40 Random Graph Models 


Definition 2.10. Let (S, d) be a metric space, and (y )y<ge<x : R+ — [0,1] be 
a set of connectivity functions, with ye = yeg. We assign to each node a position 
X; € S, and a community labelling o; € [K]. Then, 


l1—A;; 
P(ALX,0) = [] vo (10 (1 = voio (4 XX) 
i<j 


This model supposes that two nodes i,7 are connected with probability that 
depends both on their position and their community assignment. 


Example 2.10. We recover the SBM by further restraining ys¢(x) = gee to be 
constants (for all $, £). 


Example 2.11. The Geometric Block Model (GBM) restrains ype (x) = 1(x < rpe) 
for some parameters rge > 0. 


Finally, the model is homogeneous if 


h ifk=f, 
Yke = 


Your otherwise. 


2.4 Exponential Random Graph Model 


2.41 Definition and First Examples 


Exponential Random Graph Model (ERGM) provides a convenient framework to 
explain the different network statistics observed in various networks. Examples of 
network statistics include the degree heterogeneity, the transitivity of relationships 
(friends of friends tend to be friends), the homophily (the propensity to link with 
nodes sharing the same attribute), the reciprocity of ties (in directed networks), etc. 


Definition 2.11. Let n be the number of nodes, 0 = (01,..., O ERT be a vector 
of parameters, and g = (g1, . - - , 84) be a vector of network statistics. The adjacency 
matrix of an ERGM has the following probability distribution 


exp (07 g(A)) 


P(A|@) = Gi 


3 


where x (0) is a normalisation constant. 
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Example 2.12. Consider the Bernoulli random graph model, where A; are inde- 
pendent, with Aj; = Aj; ~ Ber(p;). Then 


exp (Sie OjAi) exp (07 ¢(A)) 


PA = []7;"0- pi)" = 6) ~ KG) 


i<j 


' -1 
where 6; = log are K(@) = (Tg ( - p;)) , 0 = (y)i<i<j<n> and 


(A) = Aj (we have q = N) — NW) network statistics). 
g ij q 2 2 


We notice in this Example that 6; = log aS = logit P (Aj = 1) where 
logit(x) = log 7*,. We will observe several of such relationships in the latter 


examples. 

As the Erdés-Rényi graph, the SBM and DC-SBM are particular cases of the 
Bernoulli random graph, they can also be expressed as ERGM. For example, for an 
Erdés-Rényi random graph G, p, we have 0; = logit(p) is independent of 7 and j, 
and thus the previous example reduces to 


where 0 = log rt g(A) = Diep Ag = |E| is the number of edges and x (0) = 
n(n—1) 


a-p) 7. 


2.4.2 The pı Model 


Now consider a directed graph A and let X; = (A 
independent, and define 


ij Aji). Assume that (Xj); <; are 
P (X; = (1,1) = ry 
P (Xj = (1,0)) = Sijs 
P (X; = (0,0)) = tj. 


Note that Vij = fji tij = Gi and rij + sij + sji + ti; = 1. Moreover, 


_ AyAji Aj (1—-Ay) (1-Ay) 1—-4ji) 
ra = Jia [15 Is 


i<j iŁj i<j 
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This can be re-expressed in an exponential form, as 


P(A) = exp | D rpg + Deas | | [ 


i<j i$j i<j 


where pj = log (5) and u; = log (2). We notice that 
i ij 


: 
jsji 


P (4j = 114; = 0) . 
— = logit (P (Az =1|A;=0 
Mi oe Seas ogit ( ( Y | J )) 


measures the probability of an asymmetric link between 7 and j. Similarly, 


P(e = Aga (dpi) S, 
pij = log (Aj | A; ) — log (Aj | A; ) 
P (Ay = 0| Ay = 1) P (A; = 0| Ay = 0) 


= logit (P (Ay = 1 | Aji = 1)) — Hij 


is related to the probability that A; = 1 given that A; = 1, that is the force of 
reciprocation between i and j. 

The p1-model from (Holland and Leinhardt, 1981) further restricts p; = p and 
Lij = U + a&i + pj, so that 


exp (pR + uM + > aifi + È; BjAs;) 


bg K(p, u, a, P) 


, (2.9) 


where A+; = >); Aj; denotes the in-degree of node i, and A;+ = >), A, the out- 
degree of node i, M = }/; ,Ajj is the number of edges, and R = 97; , AjAji is the 
number of reciprocated edges. We can interpret equation (2.9) as follows: 


e the parameter u governs the density of (directed) edges. In particular, if p = 
a; = p; = 0 while u F 0, then we recover a directed Erdés-Rényi random 
graph, with link-probability p such that u = logit p; 

e if a; is large, then node 7 will tend to form an out-going edges. We can thus 
call a; the productivity of node i; 

e f; refers to the attractiveness of node i, since a large £; will push many nodes 
to form in-coming edges towards ż; 

e finally, the parameter p is the force of reciprocation of ties. 
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2.4.3 Relationship Between @ and the log-odds 


We noticed in Example 2.12 that 9; = logit P (Aj; = 1), and similar relationships 
were drawn from the pj-model. The following proposition generalizes it to any 
ERGM. 


Proposition 2.8. Consider an ERGM model as in Definition 2.11. Let A; = 
{A with Aij = 1} be the graph with the edge (i,j) set to one, A; = {A with Aij = 0} 
be the graph with the edge (i,j) set to zero, and Ay, = {Ay with (u, v) # (i, j)} be the 
set of all the edges and non-edges except Aj. Then, we have 

logit P (4; =4 145) sgr (eaj) = g(A;)) 


Proof. Observe that 


P (4) 
P (4; =1 145) D P (47) +P (45) 
exp (07 e(43)) 


exp (0 g) + exp (0 T e(45)) l 
Similarly, 
exp (67¢(45)) 
exp (0 T ¢(A})) + exp (0 T ¢(45)) 


P (4; =0 144) 


and therefore 


logit P (4; =] 145) a7" [eap = (45) | À 


Further Notes 


A nice additional reference for this chapter is Barabási, 2016 (an online and inter- 
active version is available at: http://networksciencebook.com/), as well as (by order 
of relevance): Hofstad, 2016; Durrett, 2007; Chung and Lu, 2006. Finally, other 
classic books on random graphs (more focused on mathematical proofs) are Janson 
et al., 2011; Bollobds, 2001. 
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Many random graph models exist. A model worth mentioning and not covered 
in this chapter is the small-world model (Watts and Strogatz, 1998). A complete 
review of the SBM is made in Abbe, 2018. For random geometric graphs, we refer 
the reader to Penrose, 2003. 

A useful modification of the random geometric graph model with scale-free 
degree distribution is the hyperbolic geometric graph model (see e.g., Krioukov 
et al., 2010). 


DOE: 10.1561/9781638280514.ch3 


Network Centrality Indices 


One natural question in network analysis is “which are the most important nodes in 
a network?” A node can be important in several aspects. For instance, in a social net- 
work, a node can be well-connected to many social groups or a node can facilitate 
the information flow in a network. In an information network, a node can provide 
links to important information sources or can be a reference node. In an infrastruc- 
ture network, a node can be crucial for sustaining a good topological structure of a 
network. 

The importance of nodes can be characterized by a real-valued function defined 
on network nodes. The values of the function indicate the degree of importance of 
the nodes and can be used for ranking purposes. Such functions are called network 
centrality indices or network centrality measures.’ From the previous paragraph, it 
is already clear that different importance criteria potentially lead to very different 
definitions of centrality indices. Therefore, we first review various existing centrality 
indices and then discuss relations among them and several applications. Along with 
centrality indices for nodes, there are also centrality indices for network edges. Even 


1. Even though the term centrality measure is more common in the literature, we prefer to use the term centrality 
indices to disambiguate from probability measure. 
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though our main focus will be on centrality indices for nodes, we shall mention 
some centrality indices for edges as well. Up to the present, many centrality indices 
have already been proposed and the list continues to grow. We try to overview the 
important distinctive cases. 


3.1 Overview of Centrality Indices 


In this section we divide the definitions of various centrality indices into groups. We 
admit that the proposed categorization is not the only possible and some centrality 
indices can be placed in more than one group. We shall try to mention possible 
re-classifications. 


3.1.1 Distance Based Centrality Indices 


Here we describe network centrality indices based on geodesic (shortest path) dis- 
tances between the nodes. 


Node degree. The simplest distance-based centrality index is the node degree, the 
number of immediate neighbours of a node. In the case of directed networks, we 
actually have indegree d, , corresponding to the number of incoming edges, and 
outdegree dY, corresponding to the number of outgoing edges. We note that the 
nodes with large indegree can be interpreted as “authorities” and the nodes with 
large outdegree can be interpreted as “hubs”. In the context of bibliometrics, the 
indegree of an article is the number of the other articles citing this article, and the 
outdegree is the number of references present in the article. Naturally, an established 
authoritative article has many citations, and a survey article typically references 
many sources. 


Closeness. Denote by d(v, u) the length of a shortest path from node v to node 
u. Bavelas, 1950 has introduced the notion of closeness centrality. The closeness 
centrality index of node u can be defined by 


n— l 
È, dw, u) 


The closeness centrality is just a reciprocal of the average distance from the given 


(3.1) 


node to all the other nodes. Originally, the closeness centrality was defined for undi- 
rected, connected networks. If the formal extension to the case of directed networks 
is quite straightforward, the absence of (strong) connectivity poses a problem. 


Harmonic centrality. To overcome the problem of infinite path lengths in the case 
of disconnected or weakly connected networks, the notion of harmonic centrality 


Overview of Centrality Indices 47 


(a) Node degree (b) Closeness (c) Harmonic 


Figure 3.1. Three distance based centrality indices (dark blue means high centrality). 


was proposed. The idea of harmonic centrality is to swap the inversion and sum- 
mation operations (also changing the normalisation), which results in 


1 1 
n—1 2 Tea 


Thus, the harmonic centrality is the reciprocal of the harmonic mean distance. 


Thanks to the convention co7~! 


= 0, the harmonic mean naturally applies to 
disconnected or weakly connected networks. It appears that the notion of har- 
monic centrality was first proposed by Marchiori and Latora, 2000, even though 
several works independently proposed same notion or its variations, generalizations 


(Dekker, 2005; Cohen and Kaplan, 2007; Rochat, 2009; Pan and Saramäki, 2011). 


Comparison. We calculate the above described three centrality indices on the same 
graph (see Figure 3.1). Of course, the node degree centrality is large only for high 
degree nodes, which in the chosen graph are located in the bottom left. On the 
contrary, closeness centrality gives importance to nodes at the junction of the two 
clusters. The harmonic centrality appears to mix both since nodes with large degrees 
as well as nodes located at the junction have large harmonic centrality values. 


3.1.2 Spectral Centrality Indices 


Spectral centrality indices are the indices that can be obtained as a solution of some 
eigenvalue problem 


xM = Ax, (3.2) 


where x is a row-vector. A reason to operate with row vectors will be clear from the 
analysis that will follow. 


Adjacency spectral centrality. This is one of the oldest centrality indices, whose 
application to scoring chess tournaments goes back to the end of the 19-th century 
(Landau, 1895). As the matrix M, we choose the graph adjacency matrix A and 
take as centrality indices the elements of the eigenvector associated with the largest 
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positive eigenvalue. We note that such eigenvector is also called Perron-Frobenius 
eigenvector in the theory of non-negative matrices. 


Random walk centrality or Seeley’s index. Seeley, 1949 proposed to normalise 
the rows of the adjacency matrix by their sums. This implies that the reputation of a 
node is divided among the successors of that node. Thus, if we denote P = D~!A, 
where D is the diagonal matrix of nodes’ degrees, the random walk centrality is given 
as a solution of the following eigenvalue problem 


o =oP. (3.3) 


There are two probabilistic interpretations of the elements of ø . The first interpreta- 
tion says that g; is a long-term fraction of time a random walker on the graph spends 
at node 7. Also, from the theory of Markov chains, we know that £;[7;] = 1/0; 
where T; is the return time to node 7. Thus, by the second interpretation, the recip- 
rocal of g; gives the expected return time to node 7. Then, two further remarks are 
in order. Firstly, if the graph is undirected, the reversibility of the random walk, in 
this case, implies that Seeley’s index becomes proportional to the node degree. Sec- 
ondly, the original Seeley’s index was defined only for strongly connected graphs. 
If a graph is not strongly connected, one can make various regularizations. One 
regularization will be described in the next paragraph. 


PageRank. The creators of Google, Brin and Page, 1998 have proposed PageRank 
centrality index to rank web pages. PageRank models a web surfer behaviour by 
allowing a random walker to follow an out-going link with probability c and to 
restart from a uniformly random web page with the complementary probability 
1 — c. Thus, PageRank is the stationary distribution of the random walker, and 
hence it is a solution of the following system: 


m =crP + (1 — c), (3.4) 


where P = D~!A and v is the uniform distribution. In fact, instead of the uniform 
distribution one can choose a distribution concentrated on some particular set of 
nodes. This results in Personalized PageRank, which allows one to measure centrality 
with respect to a certain group of nodes. Then, v is referred to as the personalization 
distribution. 

Note that using the normalisation condition m1 = 1, we can rewrite (3.4) as 
follows: 


m =a(cP+(1-—c)1)), 


which explains why PageRank belongs to the family of spectral indices. 


Overview of Centrality Indices 49 


We can also rewrite equation (3.4) in the following way: 
a[I — cP] = (1—-e)v, 
which gives a useful explicit matrix expression for PageRank: 
x = (1 — ew U — cP}. (3.5) 


In particular, the above expression allows us to extend Seeley’s index to non strongly 
connected networks. Consider first an intermediate situation when the network 
consists of m strongly connected components and each component is described by 


its own transition matrix P®, 7 = 1,...,m. Then, using formula (3.5) we can 
write 
[Z — cP] 
— m1yT ... al]T 
r=- o) |441 a] 
[Z — pm) 
[Ba 2a q ()) 


where the vectors 1 are of appropriate dimensions and 


1 , 
r® = (1 — A—1"U — PY 


ni 


is PageRank of component 7. Now we recall from the theory of Markov chains (see, 
e.g., Avrachenkov et al., 2013a; Puterman, 2014) that the following asymptotic 
expansion takes place 


U — cP]! = 


(3.6) 


where II is the ergodic projection and D is the deviation matrix, the quantities 
given by 


D= rr — IL. 
Now if a component is strongly connected, we have 


IÔ =10®, 
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where o is the stationary distribution of the random walker on component i, 
that is, 


sð =PO, 61 =1, 
Thus, it follows from (3.6) that 
x) (o >o9 as cl, 


and a natural generalization of Seeley’s index to the case of several stronly connected 
components is 


g= [40 ... tag ld], (3.7) 


n 


where ø ® is the stationary distribution or Seeley’s index on component 7. We see 
that in such generalization the relative importance of a component is proportional 
to its size, which appears to be quite fair. In particular, this generalization means 
that it is better to have a large “local” centrality in a large component. 

The case of weakly connected components is treated in Avrachenkov eż al., 
2008b. 


PageRank with node-dependent restart probability. One natural generalization 
of PageRank is based on a random walk with restart that restarts with node- 
dependent probabilities. Specifically, let the random walk restart with probability 
c(i) from node z € V with distribution v. For convenience, define by C a diagonal 
matrix with c(z) placed on its diagonal in the appropriate position. Then, the ran- 
dom walk with node-dependent restart can be described by the following transition 
probability matrix: 


P = CDA + (I — C)1v. (3.8) 


Avrachenkov et al, 2014a proposed two generalizations of the Personalized 
PageRank with node-dependent restart: 


(i) The Occupation-Time Personalized PageRank (OT-PPR) is given by 
aj(v) = Jim Py X =j]. (3.9) 
By the fact that z (v) is the stationary distribution of the Markov chain, we 
can interpret 7;(v) as a long-run frequency of visits to node f, że., 
t 


1. 1l 
zj) = lim -Š HX = J}. 


s=1 
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(ii) The Location-of-Restart Personalized PageRank (LR-PPR) is given by 


piv) = jim. Py [X, = j just before restart] 


lim P,[X,; = 7 | restart at time ¢ + 1]. (3.10) 
t> co 


We can interpret p;(v) as a long-run frequency of visits to node j which are 
followed immediately by a restart, ż.e., 


t 


= ii : X, =K 
pj) = lim N, 2d 1{X; = j, X;41 restarts}, 


where N, denotes the number of restarts up to time ¢. 


In Avrachenkoy et al., 2014a the following explicit matrix formulae were given 
for the Occupation-Time Personalized PageRank 


= 1 — l 
mv) = We cpp CP] ," (3.11) 


with P = D~'A, and for the Location-of-Restart Personalized PageRank 
piv) =v — cP) — C). (3.12) 


We see that the formula (3.11) is indeed a generalization of (3.5). 

Denote for brevity zj(2) = z(e), where e; is the zth vector of the standard 
basis, so that (2) denotes the importance of node j from the perspective of 7. 
Similarly, 2;(7) denotes the importance of node i from the perspective of j. There 
is a very useful relation between these “direct” and “reverse” OT-PPRs in the case 
of undirected graphs. 


Theorem 3.1 (Avrachenkoy et al., 2014a). When AT = Aand C > 0, the following 
relation holds 


UE Nis: 3.13 
with 
1 
UC) = ry 14 
K(C) TU CP (3.14) 


Note that K;(A) can be interpreted as the reciprocal of the expected time between 
two consecutive restarts if the restart distribution is concentrated on node ż, że., 


K (4! = E,[# steps before restart]. (3.15) 
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Thus, given that [J — CP]~! is the fundamental matrix of the absorbing Markov 
chain, the expression (3.11) admits one more probabilistic interpretation of the 
OT-PPR in the form of renewal equation 


E, [# visits to j before restart] 


a;(v) = 
jv) E, [# steps before restart] 

In particular, if c; = c, Vi (the case of standard PageRank), we obtain the following 

simple relation between “direct” and “reverse” PPRs: 


Corollary 3.1. When AT = Aand c; = ¢,Vi, the relation (3.13) reduces to 
d;m;(t) = jm ;(j). (3.16) 


Katz’s index. Katz, 1953 has proposed a centrality index, which is a clear prede- 
cessor of PageRank. It is given by the formula 


=i 2 Pe = (pA = 0. (3.17) 


t=1 


Note that the subtraction of the identity is not really needed and one often refers 
to the following quantity as Katz index: 


Kal’ $ PA =1"U - pA. (3.18) 
0=1 


In order that the both versions will be well-defined the discounting parameter f 
should not exceed the reciprocal of the Perron-Frobenius eigenvalue, 47! (4A). 

The main difference with PageRank is that Katz centrality gives “full endorse- 
ment’ to each neighbour node pointed by an out-going link. It was observed that 
this could be appropriate in some social networks where a reference from one social 
network member to another member carries a lot of importance. 

Vigna, 2016 has noticed that using the theorem of Brauer, 1952 on the displace- 
ment of eigenvalues, Katz index can be expressed as a solution of the eigenvalue 
problem: 


K = x (BADA + (1 — BA(A))r1"), 
where v is the right dominant eigenvector of A such that 17r = A (A). This justifies 
the classification of Katz index as a spectral centrality index. 


HITS centrality index. HITS, introduced by Kleinberg, 1999, actually provides 
two centrality indices. The first centrality index ranks nodes as authorities and the 
second centrality index ranks nodes as hubs. Kleinberg, 1999 suggests that a good 
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“authoritative” node is pointed by many good “hubs”, and in turn, a good “hub” 
points to good “authoritative” nodes. This verbal statement can be represented by 
the following iterative process: 


Het — ghar 


get) — pDA, 


where A is the adjacency matrix. In the limit, the authority index is a solution of 
a= aA! A. 


Hence, the index a is the left dominant eigenvector of ATA. Or equivalently, a is 
the left singular eigenvector associated with the largest singular eigenvalue of A. 
Similarly, the index / is the right singular eigenvector associated with the largest 
singular eigenvalue of A. Note that the above definition is only valid for strongly 
connected graphs. 


Comparison. Figure 3.2 presents four spectral centrality indices on the same graph. 
We observe that the adjacency spectral centrality heavily weights the nodes in the 
bottom left of the graph, which appears to be a region with several large degree 
nodes. The random walk centrality gives more importance to other nodes as well, 
with an emphasis on larger degree nodes and nodes in the bottom left, while 
PageRank index diminishes even further the importance of nodes in the bottom left. 
In fact, as expected for the undirected graphs, due to time-reversibility, the random 
walk centrality gives the same ranking as the node degree (compare Figure 3.2b 
with Figure 3.1la). Finally, Katz centrality index gives importance to only a few 
nodes located in the middle of the left component. 


31.3 Hitting Time Based Centrality Indices 


It is often the case that not only the shortest paths but also longer paths matter 
in the analysis of social networks. A typical example of such a case is rumour or 
information propagation in social networks. In fact, this phenomenon was already 
reflected in the definition of PageRank and Katz centrality indices where all the 
paths are taken into account but longer paths are discounted. One more measure 
of “proximity” in networks is given by mean first passage times (or mean hitting 
times) of a random walker. The mean first passage time from node 7 to node j is 
given by (see e.g., Aldous and Fill, 2002; Meyer, 2000): 


ET) = ef U — Pj’, (3.19) 
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Figure 3.2. Spectral centrality indices. 


where e; is the i-th vector of the standard basis and P_; is the Taboo transition 
probability matrix obtained from P by deleting its j-th row and j-th column. 

Now, in analogy with closeness centrality, see (3.1), we can define hitting time 
centrality as 


n n 
Eeue Vee 


h; = (3.20) 
To the best of our knowledge, the expression (3.20) was proposed by White and 
Smyth, 2003. Note that in general £;[7;] # &;[7;]. Thus, one can also use the 
following alternative definition for hitting time centrality: 


> n 


h; = : ; 
oe = Pal m 


It is known (see Chandra et al., 1996; Aldous and Fill, 2002; Ellens et al., 2011) 
that there is a connection between the effective resistance in a graph and hitting 
times: 


1 
rij = Fy A + ETI), 
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(a) h (3.1.20) (b) A (3.1.21) 


Figure 3.3. Hitting time centrality indices. 


where m is the number (total weight) of edges. Thus, a natural, symmetric version 
of hitting time centrality is given by 


= 1 2m 
h 


{a A nne nO (3.22) 
1 Burj DET) + EITA) 

One additional benefit of using effective resistances is that they actually define a 
metric on a graph. 


Comparison. Figure 3.3 presents different hitting time centralities on the same 
graph. We observe that the centrality defined by (3.20) results in large weights on 
nodes in the bottom left and medium weights on nodes in the right component. 
On the contrary, the centrality defined by (3.21) gives small weights to nodes in 
the left component and large weights to nodes in the right component. Finally, the 
centrality defined by (3.22) results in large weights in nodes that are well connected, 
and small weights to more isolated nodes. 


Extension to disconnected graph. The three versions of hitting time centrality 
(3.20), (3.21) and (3.22) are well-defined only for connected graphs. There are 
at least two approaches for extending this notion of centrality to disconnected or 
non strongly connected graphs. Firstly, one can just use the harmonic mean as was 
done in the case of the standard closeness centrality. Secondly, as was suggested 
in Hopcroft and Sheldon, 2008 and Avrachenkov et al., 2018d, one can use the 
random walk with restart. Similarly to PageRank, let us consider the random walk 
with restart probability c. Then, the expected hitting time with restart from node z 
to node j is given by 


ef U = era). 
Do a 


E;lT;] 


The numerator of the above expression provides a more significant contribution 
than the denominator, especially when the parameter c is close to one. Thus, we 
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suggest to use as a hitting time with restart centrality the following quantity: 


n 
i = : 3.23 
a eel aaa 


Note that even though the network is strongly connected, the matrix [J — P_j] 
is often ill-conditioned and the introduction of the factor c helps to improve the 
condition number of the problem. 


31.4 Betweenness Centrality Indices 


A node in a social network can be viewed as important if that node contributes 
significantly to information flows or appears as a facilitator of communications. 


Shortest path betweenness centrality. Freeman, 1977 introduced the betweenness 
centrality index based on shortest paths. Let o be the number of shortest paths 
going from node s to node ¢, and let o,;(v) be the number of such shortest paths 
that pass through node v. Then, the shortest path betweenness centrality of node v 
is defined as follows: 


1 Ox (v) 
(n— 1)(n—2) 2, Os 


S,t:s,t 4V 


As was already mentioned, the information in social networks does not flow neces- 
sarily via shortest paths. Therefore, several researchers have extended the between- 
ness centrality to take into account longer paths. 


Network flow betweenness centrality. In Freeman et al., 1991, the authors suggest 
to use the concept of max-flow. This concept also allows to deal with weighted 
networks. Let wj be a weight of the link between nodes ż and j (if there is no link, 
the weight is zero). Then, a flow from node s to node ¢ is a mapping on the set of 
links such that the following two constraints are satisfied: 

1. Capacity constraint. Y (i, j) € E, fij < wij; 

2. Conservation of flow: Yv such that v ¥ s, t: 


`> fw = > fow 


v:(uv)EE v:(v,w)EE 


Then, the value of the flow f is given by 


H= 2 fw 


v:(sv)EE 
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and the max-flow is the maximum flow which can go from s to ¢. Its value can be 
found by linear programming and the celebrated max-flow min-cut theorem says 
that the maximum flow is equal to the minimum capacity over all s — ¢ cuts. 

Now, the network flow betweenness centrality of node v by Freeman eż al., 1991 
is defined as follows: 


D eririv Mst (v) 


; 3.24 
Darti Mst i i 


where m,; is the value of max-flow from s to ¢ and m,,(v) is a part of such flow that 
passes through node v. 


Current flow betweenness centralities. One more variant of betweenness cen- 
trality, which is based not only on shortest paths and uses the theory of electri- 
cal networks, was proposed by Brandes and Fleischer, 2005 and Newman, 2005a. 
Consider a weighted graph as an electrical network with conductances given by 
link weights. Suppose that a unit of current enters at node s (source) and leaves the 
network at node ¢ (sink). Then, using Kirchhoff’s current law and Ohm's law, we 
obtain the following linear system for the vector of potentials: 


l, =s, 
Lọp=b, b = 4-1, v=t, (3.25) 
0, otherwise, 


where L = D — A is the graph Laplacian. Since L1 = 0, the vector of potentials 
are determined up to an additive constant. Thus, without loss of generality, we can 
assume that the potential of the sink node is zero (this node is grounded). Then, 
the other potential values can be uniquely determined by (3.25). The throughput 
of node v is defined by 


1 
Talv) = 2 —|d,| + 5 WrwlPv = Pul >; 


w:(v,w)EE 


and the current flow betweenness centrality is given by 


1 
= Ded) 5 Talv). (3.26) 


Note that both the network flow betwenness (3.24) and the current flow between- 
ness (3.26) are well-defined only for strongly connected networks. In fact, the cur- 
rent flow betweenness is defined only for undirected networks. Furthermore, the 
system (3.25) is often ill-conditioned. To improve the conditioning of the system 
and to allow the application to not strongly connected networks, we can consider 
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at least the following two regularizations. Firstly, as in the case of PageRank, we can 
regularize the system (3.25) as follows (Avrachenkov et al., 2013b): 


[D — aA] = b. 


This modification has interpretations in terms of electrical network and random 
walks on graphs. In particular, this modification means that we multiply all the 
conductance by the factor a and ground each node with the conductance (1—a)d). 
The other equations of the current flow centrality stay unchanged. The second 
regularization consists in adding a term to the Laplacian (Avrachenkov et a/., 2015): 


[D -A+ pIo = b. 


This has an interpretation that we ground all the nodes with conductance 2, inde- 
pendent of node degree. Then, the above system has the following solution: 


¢ = U — DP]! Dib, 


with 


1 v 


Thus, the second regularization can be interpreted in terms of a random walk with 
non-uniform restart (see the paragraph about PageRank with non-uniform restart 
probabilities). Namely, the random walker restarts less frequently from high-degree 
nodes. As was observed in Avrachenkov et al., 2013b and Avrachenkovy et al., 2015, 
both regularizations give similar rankings as the original current flow centrality but 
are much easier to calculate and to approximate. An advantage of the first regulariza- 
tion could be that the bias induced by high degree nodes is suppressed, whereas an 
advantage of the second regularization is in the fact that all the nodes are grounded 
in the same way and thus we need to perform averaging only over the source nodes. 

We would like to mention that it is also very natural in the context of between- 
ness centralities to define centrality indices for edges. Specifically, for the shortest 
path edge betweenness centrality we count the number of shortest paths passing 
through an edge; and for the flow based edge betweenness centralities we calculate 
the flow passing through an edge under consideration. As we shall discuss later, the 
edge betweenness centralities are very useful for graph clustering. 


Comparison. Figure 3.4 presents the different betweenness centralities calculated 
on the same graph. We observe that the shortest path betweenness centrality gives 
importance solely to nodes at the junction of the left and right clusters. Indeed, 
those nodes are of paramount importance in short paths since they are essential for 
joining a node in the left cluster to a node in the right one. Network flow gives 
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(a) Shortest path betweenness (b) Network flow (c) Current flow 


Figure 3.4. Betweenness centrality indices. 


more importance to nodes that are well connected and less importance to nodes 
that are more isolated. Finally, current flow enhances even further this, as nodes 
with low current flows are nodes in the extremity (top right or bottom right) with 
fewer connections to the rest of the graph. 


3.1.5 Game Theory Based Centrality Indices 


One more way to define a network centrality is based on cooperative game theory. 
This is actually quite a natural way to define network centrality since cooperative 
game theory provides means to estimate the importance of a node based on the 
node’s contribution to network connectivity or network cohesiveness. 

Let us recall that the basic quantity of cooperative game theory is a characteristic 
function v(-), which is defined on the subsets of nodes and satisfies the property 
v(Ø) = 0. Myerson, 1977 extended the concept of Shapley value, Shapley, 1953, 
to graph setting. Meyrson-Shapley value is a unique allocation, Y;(v, G), satisfying 
the following two axioms: 


1. if S isa connected component of graph G, then the members of the coalition 
S ought to allocate to themselves the total value v(S) available to them, ż.e., 


> Vile, G) = (5); 


ieS 


2. VG, Vi, j € G, both nodes i and j obtain equal payoffs after adding or delet- 
ing a link (i,j), że., 


Y;(v, G) — Yi, G — 6j) = Yw, G) — Yje, G — Gj). 


The Myerson allocation can be computed by the Shapley formula 


, s(n—s— 1)! 
Yio G= $, wels U i) - reh AM, 
SCV\Ki} fi: 
where s = |S| and vg(-) is the characteristic function defined additively with 


respect to connected components. However, in general, the computation by the 
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above formula is very cumbersome. It appears that there is one natural choice for 
the characteristic function, which simplifies the computation of Myerson value. 
Inspired by path-discounting characteristic functions of Jackson and Wolinsky, 
1996; Jackson, 2010, in Mazalov and Trukhina, 2014; Mazalov et al., 2016 for 
trees and in Avrachenkoy eż al., 2018a for general graphs, the following character- 
istic function was proposed. Let ô € [0, 1] be a discount factor. Each link (or direct 
connection) gives to coalition S the value 6. Moreover, players obtain a value from 
indirect connections. Namely, each simple path of length 2 belonging to coalition 
S gives to this coalition the value 6”, a simple path of length 3 gives to the coali- 
tion the value 6°, etc. Thus, the coalition value can be expressed by the following 
formula 


v(S) = a, (G, S)5 + a (G, S)? +++ 


where a,(G, S) is the number of simple paths of length & in coalition S. Recall that 
a simple path is a path with no repeated nodes. The use of simple paths is crucial. 
In Avrachenkoy et al., 2018a it was shown that this characteristic function leads to 
a manageable expression for the Myerson value: 


-PO 2G) p 
2 3 


where a® (G) is the number of simple paths of length k containing node 7. The 
quantity Y;(v, G) as a centrality index combines some features of betweenness cen- 
trality with the path discounting as in PageRank and Katz centralities. 


3.2 Axiomatic Comparison of Centrality Indices 


As we have seen, there are many variants of centrality indices. Even inside the classes, 
such as distance-based indices or betweenness indices, there is a good number of 
variations. A big question (remaining largely open) is how to compare the centrality 
indices? 

Of course, one can compare the indices numerically on some benchmark exam- 
ples, which is a practically valid approach and we have seen examples of such com- 
parisons in the first part of this chapter. One promising analytical approach is to 
propose a set of natural properties or axioms and test available centrality indices 
against those axioms. Such an approach was initially proposed by Boldi and Vigna, 
2014. Let us describe it here. 

The first two axioms of Boldi and Vigna, 2014 test centrality indices with respect 
to change of size and change of density. Two examples of strongly connected graphs 
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with extreme densities are a cycle composed of links with the same direction and a 
clique with bi-directional links. 


Size axiom. Consider the graph Gz, p composed of a &-clique and a directed p-cycle. 
A centrality index satisfies the size axiom, if for every k there is p4 such that for all 
p > Pe the centrality of a node in the p-cycle is strictly larger than the centrality 
of a node in the &-clique. And, conversely, if for every p there is k, such that for all 
k > ky the centrality of a node in &-clique is stretly larger than the centrality of a 
node in the p-cycle. 


Intuitively, the above axiom says that a node belonging to a very large but sparse 
community should be more important than a node belonging to a dense but small 
community. 

However, the next axiom states that if communities are equal in size, a node 
belonging to a denser community should be more important. 


Density axiom. Consider the graph Dz, composed of a k-clique and a directed p- 
cycle, which are connected by a bi-directional bridge, x < y, with node x belonging 
to the clique and node y belonging to the cycle. A centrality index satisfies the density 
axiom, if for k = p the centrality of x is strictly larger than the centrality of y. 


Then, the third axiom says that it is natural that an immediate direct link always 
improves the index value of the node pointed by that link. 


Score-monotonicity axiom. A centrality measure satisfies the score-monotonicity 
axiom if for every graph G and every pair of nodes x and y such that there is no link 
from x to y, when we add a link x — y, the centrality of node y increases. 


In the next Table 3.1, from Boldi and Vigna, 2014, we summarise the verification 
of the above axioms for most common centrality indices. 

It is interesting to observe that, for the given selection of centrality indices, only 
harmonic centrality satisfies all the three axioms. This appears to be quite surpris- 
ing taking into account how basic and simple are the requirements of the axioms. 
However, as will be demonstrated by application examples, we should not imme- 
diately discard the centrality indices that do not satisfy some axioms. They can be 
useful for tasks that are not described by those axioms. 


3.3 Applications of Centrality Indices 


3.31 Social, Bibliographic and Information Networks 


Most definitions of centrality indices are originated in the domains of sociology 
and information networks. This is quite natural as centrality indices should indicate 
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Table 3.1. Axiom verification table, Boldi and Vigna, 2014. 


Centrality Size Density | Score-monotonicity 
Degree only k yes yes 
Closeness no no no 
Harmonic yes yes yes 
Betweenness only p no no 
Seeley no yes no 
Katz only k yes yes 
PageRank no yes yes 
HITS only k yes no 


which members of a social network are more important or powerful. Let us mention 
a few key contributions for centrality indices in sociology (the list is certainly not 
exhaustive): Bavelas, 1950; Bonacich, 1987; Bonacich and Lloyd, 2001; Borgatti, 
2005; Brandes, 2008; Everett and Borgatti, 1999; Freeman, 1977; Freeman et al., 
1991; Friedkin, 1991; Hubbell, 1965; Katz, 1953; Newman, 2005a. 

In bibliometrics, the citation count is simply the indegree centrality index for the 
citation network. (Citation networks have been introduced in the classical works by 
Solla Price, 1965, 1976.) Clearly, the citation count has its limitations. For instance, 
consider the case of an excellent original research article followed up by a com- 
prehensive survey article. Over the years the survey article can accumulate more 
citations than the original research article, which could even be forgotten. 

Chen eż al., 2007 have proposed to use PageRank to discover “scientific gems”. 
They have ranked the publications in the Physical Review family of journals from 
1893 to 2003 both by citation count and PageRank. Even though there appears to 
be a strong positive correlation between these two indices, there are articles with 
a very modest number of citations but with a very high PageRank score. These 
are often “scientific gems”. For instance, a very important scientific technique or 
concept can be introduced in an article and then such concept can be named in 
honour of the inventors, and is used in many other articles but no specific reference 
is given anymore. 

In recent work by Mariani eż al., 2016, the authors argue that PageRank identify 
well the established “scientific gems” but may miss new milestones. They proposed 
a rescaled PageRank, which takes into account the publication time. 

The citation network is just one example of information networks. Other 
notable examples of information networks are the world wide web (see e.g., Brin 
and Page, 1998; Kleinberg, 1999; Hopcroft and Sheldon, 2008), the authorship 
network (the authors are nodes and the citations among the authors are the links), 
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the co-authorship network (the articles are nodes and the links indicate if two papers 
were written by the same author’), journal citation network, etc. For instance, 
the authors of (Pinski and Narin, 1976; Bollen et al., 2006; Bergstrom, 2007; 
Bergstrom et al., 2008; Gonzalez-Pereira et al., 2010) use centrality indices, mostly 
spectral, to rank journals and the authors of (Fiala eż al., 2008; Ding et al., 2009; 
Yan and Ding, 2009; Fiala, 2012; West et al., 2013) use centrality indices to rank 
authors. 


3.3.2 Semi-supervised Learning 


Labelling data is a laborious and expensive process. Therefore, in many datasets the 
amount of labelled data is small and standard supervised machine learning methods 
either lead to a significant amount of errors or are not applicable at all. Fortunately, a 
graph-based semi-supervised learning method can help in such situations (Chapelle 
et al., 2006). 

The main idea behind graph-based semi-supervised learning is first to construct 
a graph on the data points, where a link between two data points indicate a strong 
relationship between these points. And then, one can use a similarity measure on 
the graph to assign unlabelled data points to the classes defined by the labelled data 
points. 

One example of a similarity measure is given by the Personalized PageRank, see 
(Avrachenkov eż al., 2012). Suppose that class & is defined by a set of labelled points 
Ly. And let vg be some (e.g., uniform) distribution with the support over the set 
Ly. Then, we can define the similarity of data point u to class k by 


yk) = (1 — evel — cP) hey, 
with P = D~'A. Thus, we attribute point u to class &, if 
k= arg max 1(k). 


An interested reader can find more examples of measures of node similarity in 
(Avrachenkov et al., 2019). Many node similarity measures are related to cen- 
trality indices. We shall discuss semi-supervised learning in much more detail in 


Chapter 5. 


2. Note that one can also consider a co-authorship network, where the authors are nodes and a link is present if 
two authors have co-authored an article. Erdés number network is one famous example of a co-authorship 
network. 
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3.3.3 Community Detection 


The community detection problem is the problem of finding tight-knit groups of 
nodes in a network. We dedicate to this important topic the whole Chapter 4, 
but for now let us just point out some applications of centrality indices to the 
community detection problem. 

Betweenness centrality indices are very efficient in solving the community detec- 
tion problem. Specifically, in (Newman and Girvan, 2004) and (Newman, 2005a) 
the edges with the largest values of edge betweenness centrality are deleted and then 
the betweenness centrality is recomputed and then again the edges with the largest 
values of betweenness centrality are deleted, etc. This process will eventually lead 
to components that are disconnected between themselves. 

The authors of (Avrachenkoy et al., 2008a) proposed first to find central nodes 
representing well communities with the help of PageRank and then as in the 
semi-supervised learning, to use Personalized PageRank to attribute nodes to 
communities. 

Personalized PageRank and other centrality indices have also been applied to 
local graph clustering: Orponen and Schaeffer, 2005; Andersen et al., 2006; Zhu 
et al., 2013; Orecchia and Zhu, 2014; Gleich and Mahoney, 2014. This is clearly 
related to graph-based semi-supervised learning. 


3.3.4 Further Applications 


Historically, the first application of centrality indices was in sport and in particu- 
lar in chess (Landau, 1895), which is followed by many other applications in this 
domain, just to name a few (Wei, 1952; Kendall, 1955; Keener, 1993; Callaghan 
et al., 2007; Langville and Meyer, 2012). 

Centrality indices play an important role in the analysis of network robustness 
(Albert et al., 2000; Holme et al., 2002; Ellens et al, 2011; Rueda et al., 2017; 
Ofori-Boateng et al., 2021). In particular, Clemente and Cornaro, 2020 proposed 
new centrality indices characterizing the node and edges with respect to their effect 
on the vulnerability of a network. 

Many recommender systems use centrality indices, in particular, centrality 
indices based on random walks: Fouss et 4l., 2007; Gori et al., 2007; Boldi et al., 
2008; Mei et al., 2008; Fouss et al., 2012; Davoodi et al., 2013. 

Centrality indices are also used in various NLP tasks such as semantic similarity 
(Sinha and Mihalcea, 2007), word sense disambiguation (Agirre and Soroa, 2009) 
and person name disambiguation (Smirnova et al., 2010). 
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Further Notes 


In addition to the work by Boldi and Vigna, 2014, there is a number of other 
works characterizing centrality indices by axiomatic approaches. Already Sabidussi, 
1966 proposed several natural axioms to test centrality indices. In Altman and Ten- 
nenholtz, 2005 and Was and Skibski, 2018, the authors proposed axiomatization 
of Seeley’s and PageRank centralities. Then, in Skibski and Sosnowska, 2018 the 
distance-based centralities were axiomatised. 

In many applications, one needs to measure the centrality of a group of nodes 
rather than that of a single node. For instance, it may be required to evaluate an 
influence of a particular social group on the society or to evaluate the importance 
of a department within an organization. Several works, starting from Everett and 
Borgatti, 1999, proposed various variants of group centralities: Kolaczyk et al., 
2009; Veremyev et al., 2017; Akgün and Tural, 2020. Let us emphasize that in 
the majority of cases it is not suitable to simply sum centrality values of individ- 
ual nodes. It should come without surprise that the methods of cooperative game 
theory are very natural to define group centrality indices, see e.g., Michalak eż al., 
2013; Szczepański et al., 2016. 

It is often important to find only Top-k central nodes of a network. This prob- 
lem was investigated in Avrachenkov eż al, 2011, 2014c; Ostuni et al., 2013; 
Avrachenkov et al., 2014b; Yoshida, 2014; Borassi and Natale, 2019; Fan et al., 
2019; see also references therein. 

We note that if in Katz centrality the geometric discounting is changed to the 
Poisson, factorial discounting, this gives Estrada’s subgraph or communicability 
centrality (Estrada and Rodriguez-Velazquez, 2005; Estrada and Hatano, 2008). 
Furthermore, if now the adjacency matrix is replaced with the transition matrix of 
the random walk, this results in the heat kernel PageRank (Chung, 2007). 

In the survey (Gleich, 2015) one can find an excellent extensive overview of 
various modifications and applications of PageRank. 
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Chapter 4 


Community Detection in Networks 


We saw in the introduction that the node set of many networks can be partitioned 
into several groups based on the node attributes or on the node’s behaviour. For 
example, the members of the karate-club network split into two groups (Zachary, 
1977), while the blogs of the political blog network are labelled as being either liberal 
or conservative. Community detection (also referred to as community recovery or graph 
clustering) consists in inferring the latent community structure based on the node’s 
interactions. 

Community detection is a delicate problem, as the notion of community is 
strictly speaking ill-defined. Indeed, although community structures are quite com- 
mon in real networks, it is hard to properly define what is a community. Nonethe- 
less, we can provide the following hints. 


° Definitions based on node similarity. We can define as communities some 
groups of nodes that behave similarly to each other. For example, we could 
separate the nodes of a social network between influencers (people posting a 
lot of content and being followed by numerous users) and followers (periph- 
eral nodes interacting mostly with influencers). To assess the similarity of two 
nodes, we can use a node similarity measure (such as Personalized PageRank 
or hitting time based centrality indices, see Chapter 3). 
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(a) Dolphin network. (b) Political blogs. (c) Live Journal. 


Figure 4.1. Real networks with community structure. 


e Local definitions. We can intuitively define a community as a set of nodes 
interacting a lot with each other. In that case, communities are groups of 
nodes that are densely connected within the groups but sparsely connected 
to the rest of the network. 

e Global definitions. We can also evaluate the quality of a graph partition 
into disjoint communities using a quantity called modularity. This quantity 
compares the number of edges inside the community to the expected number 
of internal edges in a null model. 


In addition to the above problem of choosing an appropriate definition of com- 
munities, we will see that community detection is often computationally difficult, 
and hence one needs to rely on approximation algorithms. 

Real networks for which ground-truth communities are known are often used 
to compare and evaluate various community detection algorithms. We give below 
a non-exhaustive list of such networks. We plot some of them in Figure 4.1 and 
summarise in Table 4.1 some statistics of those networks. We also refer to the intro- 
duction, where these and some other networks were described in more detail. 


e A selection of standard networks are available on Mark Newman personal 
webpage: http://www- personal.umich.edu/- mejn/netdata/, including 
the popular Zachary Karate Club (Zachary, 1977), an interaction network 
between dolphins (Lusseau eż a/., 2003) and the political blog data set (Adamic 
and Glance, 2005). 

e The Lings webpage https://lings.soe.ucsc.edu/data hosts some data set, 
including Cora, Citeseer, Pubmed, and WebKB (Lu and Getoor, 2003). 

e The netset webpage https://netset.telecom-paris.fr/ hosts several data sets, 
including graphs with links between wikipedia articles. 

e Finally, the Stanford Large Network Dataset Collection (https://snap.stanford. 
edu/data/) hosts a wide sample of larger networks. 


Using synthetic networks is also common to assess the validity of commu- 
nity detection algorithms. A widely used random graph model with community 
structure is the Stochastic Block Model (SBM) and its degree-corrected variant (see 
Section 2.3). 
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Table 4.1. Selection of real data sets for community detection with ground-truth. 


Category Data Set n |E| K Features 
karate club 34 78 2 0 
Social networks dolphins 62 159 2 0 
LiveJournal top2 2766 24138 2 
cora 2485 5069 7 1433 
Citation networks citeseer 2110 3668 6 3703 
DBLP top2 13326 34281 2, 
political blogs 1222 1671778 2 0 
Web networks wikischools 4403 100382 17 0 
wikivitals 10008 629521 11 0 
MNIST 70,000 — 10 784 
Images fashionMNIST 70,000 - 10 784 
CIFAR-10 60,000 — 10 
CIFAR-100 60,000 — 100 


This chapter is structured as follows. We firstly present in Section 4.1 some cut- 
based methods, and their relaxation in the form of spectral clustering. Section 4.2 
introduces modularity-based methods, and in particular the very efficient Louvain 
algorithm for modularity maximisation. The Bayesian framework for community 
detection is presented in Section 4.3. Moreover, in each section, we validate the pro- 
posed methods by performing numerical experiments and we discuss each method’s 
limitations. Finally, we end the chapter with a theoretical analysis of the community 
detection problem in Section 4.4. 


4.1 Cut-based Methods 


In this section, we investigate the problem of separating the graph into K > 2 
groups, such that inside a group the edge density is higher than between two 
different groups. 


4.1.1 Graph Bisection 


We consider a graph whose nodes are {1,..., 2}, and the adjacency matrix is A = 
(ai) iy The graph is undirected but possibly weighted, so that aj = aji > 0. 
The degree dj of a node 7 € V is defined as }Y7_ aj. 
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Our goal is to partition the node set V into two subsets V1, V2 such that Vj N 
V2 = Ø (non-overlapping communities) and Vj U V2 = {1,...,7}. Note that 
Vz = Vý, where Vf denotes {1,...,2}\Vi, the complementary set of V1. 


Definition 4.1. Given a set of nodes Vı and an undirected graph represented by 
its adjacency matrix A, we denote by Cut(A, V1) the total weight of the edges going 
from Vj to its complement Vý. That is, 


Cu(4Vy= > ay. 


ie Vi jeVy 
At first glance, we might be tempted to solve 


Vı = arg min Cut(A, V1). (4.1) 
Vicln] 


But, the trivial solutions of this minimisation problem are V, = V and A = @, 
which correspond to assigning every node to one cluster, and letting the second 
cluster empty! Moreover, even if we impose Vı # V and Vi Æ G, we likely obtain 
a solution where almost all the nodes are in one cluster and only a few nodes in the 
other cluster. 

Consequently, one can impose that the predicted sets Vı and Vf should be 
roughly of the same size. To do so, one can penalize the imbalanced solutions. 
Firstly, let us impose the set V; and Vý to be exactly of the same size by restraining 
the minimisation problem over sets V; such that | V1 | = 7 The new minimisation 
problem 


arg min Cut(A, V1) (4.2) 
vici: Vil = ¥ 


is called the graph bisection problem. Of course, in practice the two clusters are often 
of different size. This will be presented in the next section, along with a generaliza- 
tion to K clusters (K > 2). 

But, even in this simple two cluster setting, another problem arises: the minimi- 
sation problem (4.2) is NP-hard (Wagner and Wagner, 1993; Garey et al., 1974). 
Therefore, we have to rely on approximate methods. In the following, we propose a 
relaxation method based on the Laplacian. A similar method based on the adjacency 
matrix and a different method based on Semi-Definite Programming are presented 
in Section 4.1.3. 


First relaxation method: Laplacian spectral clustering 


Proposition 4.1. For Vi C [7] such that |Vi| = 5, define z € {—1;1}” the vector 
associated with the partition (V1, V{), that isz; = 1 ifi € Vı andz; = —1 otherwise. 


70 Community Detection in Networks 


We have 


argmin Cut(4,Vj) = arg min z! Lz. 
vich: Vil = 3 vici: Vil = 3 


Moreover, z L 1, and Ilzll3 =n. 


Proof. The facts that z L 1, and lFales = n are immediate from the constraint 
|Vi| = 2/2. Moreover, we notice that 


( J 1 ifieVijeVforie Vijev 
zi> z] = 
4 0 otherwise. 


Therefore, 


n 


1 2 1 
Cut(A, Vi) = = az = 5 > afa- 4) = z7 le 


ie Vi jeVi ij=1 


where the latter equality holds by Proposition A.10 from the background 
Section A.2. This ends the proof, as the factor + > 0 does not impact the minimi- 


sation problem. 


Minimisation problem (4.2) is therefore equivalent to 


Z = argmin z! Lz, (4.3) 
ze{—1;1}” 
lzl3=7 
zLlly 


where the two associated clusters are simply T = {i € [7a]: 2; = 1}and Ve = 


{i € [a]: Z; = —1}. A possible continuous relaxation of (4.3) is 
A o a of 
x = argmin x` Lx. 
xER” 
IIxlI5=n 
rl a 


By relaxation, we mean that we went from z € {—1;1}” to a real value vector x € 
IR”. This allows us to use standard calculus methods to solve the arg min problem 
(cf. Lemma 4.1). Once x is computed, we can cluster according to the sign of xj. 
This leads to the standard spectral clustering method (Algorithm 4). Nonetheless, 
there is in general no guarantee that the solution of the relaxed problem (4.3) should 
be equal to the true solution of the original problem (4.2). We refer to Section 4.4 
for a more careful discussion. 
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Lemma 4.1. Let 0 = 11 < Az < +++ < Ay be the eigenvalues of L, and v1,...,Un 
the corresponding basis of orthogonal eigenvectors, normalized so that \|v;\|; = n. We 
have 


arg min xT Ix = V2. 
xER”; 
IIxl3=03 
xLis 


Proof. Indeed, vı = 1,, and we conclude the proof by the Courant-Fischer Theo- 
rem (see Theorem A.13 in Appendix A.3.3). 


Algorithm 4: Standard Spectral Clustering — 2 clusters. 
Input: graph standard Laplacian L. 
Output: clustering assignment 2 € {1;2}”. 
Spectral Step: 
e let vz be the eigenvector of L associated to second smallest eigenvalue; 
e fori=1...n, let Z; = 1 if (v2); > 0, and Z; = 2 otherwise. 


Return: 2. 


41.2 General Case: More Than Two Clusters 


In this section, we extend the method of the preceding section to the general situ- 
ation of K > 2 clusters of possibly different sizes. 

Let Vi,..., Vg be a partition of V into K non-overlapping clusters, that is 
V,U---UVe = V and V} N Ve = @ for k  €. We highlighted in the preceding 
section the importance of penalizing the partitions of unbalanced cluster sizes in 
the Cut-minimisation problem. To measure the size of a cluster Vz, we define the 
two following metrics: 


n 


IV] = SiGe Vp) and vol (Vz) = >d 


i=1 ieV, 


The quantity |V,| corresponds to the number of nodes belonging to the set Vs, 
while vol(V;) is the volume of the set Vz, that is the sum of the degrees of nodes 
belonging to Vz. Instead of minimising directly the Cut, we will minimise one of 
these two quantities: 


5 Cut(A, Vp) 


RatioCut(A, Vj,..., VK) = 7 ; (4.4) 
c 
Cut(A, V; 
NCut(4, Vi,....VK) = A, ana (4.5) 
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The Ratio-Cut (resp., Normalized-Cut or NCut) corresponds to a Cut penalized with 
respect to the size (resp. volume) of the sets Vz: small sets bear a large penalty. Hence, 
we can expect that the solutions minimising the Ratio-Cut or the Normalized-Cut 
lead to clusters of balanced sizes. 

As before, the minimisation of those quantities for all possible partitions 
(V1,..., Vx) is NP-hard, and we will instead solve a relaxed version of the problem. 
Let us define the matrix H = (hg) € R’** by: 


——, if weV, 
Viela], VRE[K]: be = 4 VIM (4.6) 


0, otherwise. 


H is a matrix containing the K indicator vectors as columns, where the size of each 
set V; is used as a normalisation term. Similarly, let us define N = (jz) € R”** as: 


1 
—, if v; € Vp, 
Vie [n], Yke[K]: ng = 4 vyol(V;) (4.7) 


0, otherwise. 


Here we used the volume of each set Vz as a normalisation term. We have the 
following lemma. 
Lemma 4.2. The following holds: 


(i) RatioCut(A, Vi,..., Vx) = Tr(H’LH); 
(ii) NCut(4,Vi,...,Vk) = Tr (NLN); 
(iii) HTH = Ig and NTDN = Ik. 


Proof. This lemma follows from the observations that 


Cut(A, V, 
m+n E a 
| Vel 
where H., denotes the column & of H, and 
Cut(A, V) 


N'LIN)» = NILN, = ————. 
( Jek LN} TA 
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Indeed, 


1 
HILH; = z S ag (bie = hin) 
iJ 


1 1 


iE Vk jZ Vp i£ Vp jE Vk 


1 1 
= —2Cut(A, VY) —. 
2 IVl 


The second equality holds since hig = hyp if @ € Vp j € Ve) or (i ¢ Vej £ Va). 
The computations for (NT LN) pp ate similar. 


Therefore, minimising the RatioCut can be rewritten as: 


arg min Tr (HLA) y (4.8) 


where L = D— A, and H is defined in equation (4.6). Similarly, minimising NCut 
can be rewritten as: 


arg min Tr (v U) ; (4.9) 


where L = D7!/*22D7~1/2, U := D'/2N, and N is defined in equation (4.7). 
The next step is to relax minimisation problems (4.8) and (4.9), by keeping 

only the constraints HTH = Ig and UTU = Ig. The solution of these relaxed 

problems is given by the next proposition (we refer to the Proposition A.15 in 


Appendix A.3.3 for the proof). 


Proposition 4.2. Le M e R’*” be a symmetric matrix. A solution of 
arg min Tr (X7 MX) where X e R”*K is subject to XTX = Ix is given by the 
matrix V € R”*K whose columns are the first K orthonormal eigenvectors of M. 


Once the relaxed problem is solved, we are left with a m-by-K matrix whose 
columns correspond to the K first eigenvectors of L (or £). To reconvert this real- 
valued matrix to a discrete partition, a standard way is to consider the 7 rows of K 
(hence giving data points in R*), and apply k-means algorithm on these 7 data 
points. More precisely, k-means consists in the following minimisation problem 


(2.2) = arg min || ZX — VI% (4.10) 
ZEZnK 
XeRKxK 
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where Z,,x denotes the space of membership matrices, that is n x K matrices with 

entries in {0, 1} for which each row / has only one non-zero element. While solving 

the minimisation problem (4.10) is NP-hard, there exists (see Kumar eż al., 2004) 
a polynomial time procedure finding 

(2,2) eZ, x RO 

ss 2 i 2 

s.t. ZX —V < (l+e) min |ZX — V|lZ. 

| |z < 0+6) min | lz: 


YeRKxK 


(4.11) 


Once Z is found, we return the predicted clusters: node ż is in cluster & if Z; = 1. 
We summarize this in Algorithm 5. 


Algorithm 5: (Normalized) spectral clustering. 
Input: graph Laplacian L (resp. normalized Laplacian £), number of clusters 
K. 
Output: predicted node labelling vector z € [K]”. 
Spectral Step: 
e compute v1, ..., vg the K orthonormal eigenvectors of L (resp. of £) 
associated to the K smallest eigenvalues; 
e let V e R*** be the matrix whose column & is vg. 


Clustering Step: 


e let (Z a ) be an (1 + €) approximate solution to the k-means 
problem (4.11); 


° for every node i = 1 - - - n, let = k if Zp = 1. 


Return: z. 


41.3 Semi-definite Programming 


Similarly to the preceding section, we can also consider the problem of minimising 


K 
Cut(A, Vis... VK) = >) Cut(A, Ve) (4.12) 
k=1 
over the partitions (V1, . .., Vg) of V such that all the clusters Vz have equal size 


|V|/K. Similarly to what was done in the preceding section, we can show that 
minimising (4.12) is equivalent to maximise 


ae (x7ax) (4.13) 
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such that X = (x;g) isa n x K matrix with 


1 if vie Vp, 
Xip = : 
E 0 otherwise. 


Maximising expression (4.13) leads to another Spectral Clustering method based 
on the adjacency matrix, where one look for the K eigenvectors associated to the K 
largest eigenvalues of A. We can also propose a different relaxation method. Indeed, 
from the relationship 


T (x"ax) =" (4xx7) (4.14) 


it turns out that minimising (4.12) is equivalent to solving the following optimisa- 
tion problem 


argmax < Á, Y > (4.15) 
Ye{0,1}”*” 
Y>0 
rank(Y)=K 
Yji=1 
Yl,=%ķ ln 


where < A, Y >= Tr(AY7) denotes the usual matrix scalar product. The first 
four constraints in (4.15) force Y to be of the form XXT while the last constraint 
forces the clusters to be of the same size. 

A possible relaxation of optimisation problem (4.15) is the following semi- 
definite programming 


argmax < Á, Y >. (4.16) 
YeR”*” 

Y>0 

Y;<1 
Y1,=21y 


41.4 Discussion 


Complexity of spectral clustering 


Spectral methods require the computation of the eigenvectors, which has a worst- 
case complexity of O(n). However, in practice when dealing with a sparse 
matrix whose eigenvalues are well separated, the complexity can be close to 
O(Kn), where K is the number of eigenvectors needed (see e.g., Demmel et al., 
2008). 
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Table 4.2. Performance of spectral clustering on real data sets. 


Data Set n |E| K Accuracy 
karate club 34 78 2 94% 
dolphins 62 159 2 98% 
political blogs 1,222 16,717 2 52% 
DBLP-top2 13,326 34,281 2 55% 
LiveJournal-top2 2766 24,138 2 99% 
cora 2,485 5,069 7 37% 
citeseer 2,110 3,668 6 59% 
MNIST 70,000 784,186 10 63% 


Performance of spectral clustering on real data sets 


We first show in Table 4.2 the performance of spectral clustering, as it is imple- 
mented in the scikit-learn Python library’. This implementation uses the normal- 
ized Laplacian (and in practice, it has been observed that the normalized Laplacian 
outperforms the standard Laplacian). 

We also show in Figure 4.2 the performance of normalized spectral clustering 
on the MNIST data set when we select two digits. We observe that most digit pairs 
are well predicted, but digit pairs (4,9), (5, 8) and (7,9) are the hardest to distin- 
guish, showing an accuracy of 0.53, 0.70 and 0.72, respectively. This highlights the 
intuitive fact that in those pairs the digits look similar. 


Spectral methods and dangling trees 


Let us analyse the failure of spectral clustering on the political blogs data set. 
Figure 4.3 shows the values of eigenvector components of £ associated to the second 
and third smallest eigenvalues. We see that the entries of the second eigenvector are 
localised over a few nodes. Moreover, those nodes are associated to a dangling tree, 
and do not correspond to a meaningful community structure (see Figure 4.3c). On 
the contrary, the entries of the third eigenvector correspond to the correct com- 
munity structure. In fact, using this eigenvector for clustering would lead to an 
accuracy of 95%. 

Figure 4.3 shows that for this dataset, the good eigenvector for clustering is the 
third one, while the second eigenvector is concentrated around low degree nodes, 


1. sklearn.cluster. SpectralClustering 
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Figure 4.2. Accuracy of normalized Spectral Clustering on the MNIST data set restricted 
to two digits. 
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Figure 4.3. Analysis of the failure of spectral clustering on the political blogs data set. 
Top: values of the eigenvector components of £ associated to the -th smallest eigen- 
value, for k = 2 and k = 3. Bottom: graph where the node colors correspond to the pre- 
diction made using the sign of the entries of the k-th eigenvector. 
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Table 4.3. Accuracy of regularized spectral clustering on real data sets, for different val- 
ues of t. Note that t = 0 corresponds to non-regularized spectral clustering. 


Regularized Spectral Clustering Accuracy 
Data Set T=0/T=1 t=d 
political blogs | 52% 95% 79% (d = 27.4) 
DBLP top2 | 55% | 55% 55% (d = 5.1) 
cora 37% | 51% 52% (d = 4.1) 
citeseer 59% | 42% 32% (d = 3.5) 


forming a dangling tree.” Since this behavior results in partition of the graph into 
one large community with almost all the nodes and a small one with only a few 
nodes, it is easy to spot in practice. To solve this issue, one simple solution would 
be to look at higher order eigenvector. But, how to determine the correct eigenvec- 
tor? Indeed, this might not always be an easy task. Firstly, it could happen that the 
correct eigenvector is in a lower position, say 5th or 7th, and localising it among 
noisy eigenvectors might be non trivial. Moreover, it is difficult to extend this rea- 
soning for more than 2 clusters. 

The regularization technique aims at solving this issue. It consists in performing 
Ue where A, := A+ ZITI, and D, 
is the associated transformed degree matrix. The matrix A; is a perturbed version 


spectral clustering on £L; := J — D7 "4, D7 


of the initial adjacency matrix A, where we added an edge of weight © between all 
nodes’ pairs. This tends to bring back the dangling trees to the rest of the graph, 
hence restoring order in the eigenvectors (Zhang and Rohe, 2018). Moreover, Le 
et al., 2017 showed that the regularized Laplacian £L, of Bernoulli random graphs 
is better concentrated around its expectation than the normalized Laplacian £ (we 
develop further this point in Section 4.4.4, see in particular Theorem 4.9). The 
perturbation parameter T is typically taken as t = 1 or t = d where d is the 
average degree of the graph. We compare in Table 4.3 the performance of standard 
spectral clustering with the regularized version. 


Spectral methods and geometric data In many situations, nodes can 
have geometric attributes (for example a position in a metric space). As shown 
in Avrachenkoy et al., 202 1a, this geometric structure handicaps cut-based cluster- 
ing method. Indeed, in this case, the Fiedler vector might be associated to a geo- 
metric configuration, hence bearing no information about the latent community 
labelling. To avoid this pitfall, Avrachenkoy eż al., 2021a proposed to look at 


2. Note that if one were to use the standard Laplacian, we would also observe an analogous phenomenon, with 
noisy eigenvectors concentrated around high degree nodes. 
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Figure 4.4. Analysis of the failure of spectral clustering on a Geometric Block Model, with 
100 nodes and inter- and intra-distance cutoffs rin = 0.07, rour = 0.02. 
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Figure 4.5. Accuracy obtained on weighted graph build using a subset of the MNIST data 
set (a = 1000 images representing digits 4 and 9) using the different eigenvectors of the 
normalized laplacian matrix £. The eigenvector of index & corresponds to the eigenvector 
associated with the k-th smallest eigenvalue of £. 


higher order eigenvectors in order to recover the correct community memberships. 
Figure 4.4 highlights this situation. While the second and fourth eigenvectors give 
configurations based on the node location, recovering the node labels is better done 
with the 10-th eigenvector. The exact rang of the ideal eigenvector is then depen- 
dent on the model parameters, and we refer to Avrachenkov eż al., 2021a for a 
detailed analysis. 

Let us also show that a higher order eigenvector can lead to a better clustering 
in real data sets with geometric components. We select 1000 images from MNIST, 
representing digits 4 and 9, and construct a &-nearest neighbors (k = 8) similarity 
graph with Gausian weights. The digits 4 and 9 form the hardest digit pair to 
distinguish. We plot in Figure 4.5 the accuracy obtained by spectral clustering as 
a function of the eigenvector order. We emphasis the fact that, unlinke the politicl 
blog data set, this is not an artifact due to dangling trees. We plot in Figure 4.6 
the predicted clusters using the eigenvectors associated to the second and smallest 
eigenvalues of the graph’s normalized Laplacian, and compare them with the true 
clusters. We notice that the predicted clusters are of balanced sizes. We also note 


that the NCut of the true labels is 3.8, while the NCut of the predicted labels 
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(a) True labels (b) Labels using v2. (c) Labels using v3. 


Figure 4.6. Different clusterings on the same graph as in Figure 4.5. The colors on 
Figure 4.6(a) shows the true labels, while the colors on Figures 4.6(b) and 4.6(c) cor- 
responds to the predicted labels using respectively the eigenvector associated to the 
second and third smallest eigenvalues of the normalized Laplacian. 


associated with the prediction using the second (resp. third) eigenvector is 2.7 (resp. 


3.7). Therefore for this graph, the correct labels do not correspond to the smallest 
normalized cut. 


4.2 Modularity-based Methods 


In this section, we will first define a quality function, called modularity (first intro- 
duced by Newman and Girvan, 2004), that aims to compare the density of links of 
our cluster assignment with the one we would obtain if the graph were build from a 
random null-model. By optimising the modularity over the space of all partitions, 
we identify groups of nodes that are more densely connected to each other than 
one would expect by random chance. As maximisation of modularity is NP-hard, 
we describe two common methods to do it approximately. 


4.21 Definition 


Definition 4.2. Given a vector z € [”]” such that z; denotes the community of 


node i, the modularity of z is defined by 


1 
M(z) = JEJ 2, (A; — Py) 1 (z: = 3%), (4.17) 
did; 


where |E] is the number of edges and P; = JET 


Remark 4.1. A few remarks are in order: 


e We let the community labelling z take values in [7], so that potentially we 
have n communities (hence every node can be alone in its community). Also, 
some communities can be empty. 
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(a) Optimal Partition: M = 0.41. (b) Sub-optimal partition: M = 0.17. 


a oa 


(c) Single partition: M = 0. (d) Negative modularity: M = —0.11. 


Figure 4.7. Modularity M defined in Equation (4.17) for several partitions of a network 
with two obvious communities. The figure is inspired from Barabasi, 2016. 


3: 


e Figure 4.7 shows the modularity of several partitions on a toy graph. In par- 


ticular, we observe that in this toy graph, the “obvious” community structure 
corresponds to the largest modularity (Figure 4.7(a)), and variations from this 
partitions lead to smaller modularity (Figure 4.7(b)). Moreover: 


— ifz = 1, (e, the partition z assign all the nodes are in a single group), 
then M (z) = 0 (Figure 4.7(c)); 
— if the partition z assign all the nodes to be alone in their own community 


(ie, z = (1,2,---,”)), then M(z) < 0 (Figure 4.7(d)). 


These two simple facts hold for any graph and are easy to establish. 

The factor 1/(2|£]) is a normalisation factor. In particular, showing that 
—1 < M(z) < 1 for any graph and any node labelling vector z is straight- 
forward’. 

Pi; is the expected number of edges between and j if the graph was drawn 
from a configuration model. Indeed, node 7 has d; outgoing edges, and the 
probability that one of this edge goes to node j is d;/(2|E|), where |E] is the 


In fact, with some additional work, one can show that — 5 < M(z) < 1 (Brandes et al., 2007). 
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total number of edges in the network. In Section 4.4.1, we will further justify 
this choice by linking the modularity to the MAP estimator of a SBM. 

e In practice, good values for modularity typically lie between 0.3 and 0.7. We 
refer to Table 4.4 for the modularity value of several networks with ground- 
truth community. 

¢ Unfortunately, optimising the modularity is NP-complete (Brandes eż al., 
2007). 


Efficient computation of modularity 


The following lemmas provide formulas to compute the modularity and to update 
the modularity, that will be useful for the algorithms presented in the next section. 
For a community labelling z € [⁄]”, we define the fraction of edges going from 
community & to community f as 


1 
l) = Fy ŞA; ei = AG =O, 
ij 


and the mass m,(C) of a community & as the sum of the degrees of the nodes in 
community k normalized by the total sum of node degrees 


1 n 
mp(z) = JZ] > di (zi =k). 
i=1 


Lemma 4.3. The modularity of a community labelling is equal to 


n 


M(z) = X (eue) — (me))’) . 


k=1 


Proof: The proof is immediate, by writing 


MO) = oe DD M-P) =e =O 


and using the definitions of egg(z) and mg(z). 


Lemma 4.4. Let 2°"! € [n]” be a community labelling, and define z”®™ as the labelling 
obtained by merging two communities ky and kz: 


new o E if zo = ky 


ee otherwise. 
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The resulting change of modularity is equal to 


M (z*") -M ( a) -2 [ A eo) — mp, ( a) py ( ay. 
Proof: For any k ¢ {k1, k2}, ekk (2°!¢) = epp (2) and mg (2) = mp (2), 


Moreover, since {2: 2°” = ko} = Ø, we have eg,4(z"™) = 0 and mg (z") = 0. 
Therefore, using Lemma 4.3, the difference M (z"*Y) — M (204) is equal to 


ke{ky,ko} 


Since {ż: z} = ky} = {i: gold = kı} U {ż: gold = k2}, we have 


ld ld ld 
kiki a) = Chk; (2° ) + 2ekik (2° ) + Cb by (2° ) 


and 


which thus leads to the stated result. 


Lemma 4.5. Let 2" and z”? be two community labellings that differ only for one 
node i. Letz = a and 2 = b. Then the difference of modularity M (z"*") — 
M (2°'¢) is equal to 


esp (2°"") — ms g _ | - (2) aa (| l 


Proof. Since the only modified community are a and 0, for any k ¢ {a,b}, we 
have ep (z") = epp(z°!4) and mp(z™”) = mp(z°!), The result then holds by 
Lemma 4.3. 


4.2.2 Greedy Algorithm 


The first modularity maximisation algorithm, proposed by Newman, 2004, and 
reproduced here (Algorithm 6) iteratively joins pairs of communities if the move 
increases the partition’s modularity. Some extension have been proposed (see for 
example Clauset eż al., 2004), but those have been outperformed by Louvain algo- 
rithm (Subsection 4.2.3). 
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Algorithm 6: Greedy algorithm for modularity maximisation. 
Input: adjacency matrix A. 
Output: node labelling Z = (21,..., Zn). 
Initialize: assign each node to a community of its own, starting with ⁄ 
communities of single nodes (in other words, set z; = 7). 
Update: 
1 for each community pair connected by at least one edge do 
2 | (i) Compute the modularity difference AM obtained if we were to merge 
the two communities. 
3 | (ii) Identify the community pair for which AM is the largest and merge 
these two communities. (Modularity is always calculated for the full 
network, and AM can be negative.) 


4 Repeat the Update step, recording M at each step. 
5 Stop when all nodes are merged into a single community. 
Return: the partition 2 for which M is maximal. 


Proposition 4.3. The time complexity of Algorithm 6 is O (n (\E| + n)). 


Proof: By Lemma 4.4, the computation AM is done in constant time. At the initial 
update step, we have |E] of such computations to do (and then at each update step, 
we have less than |E], since we merge the communities). Then, after identifying 
the max AM (which is done during the computations of all the AM), we need 
to recompute the adjacency matrix. This can take up to O(7) operations. Finally, 
we need to do the update step  — 1 times. Hence, the overall time-complexity is 
of the order n — 1 times |E| + 7. 


4.2.3 Louvain Algorithm 


Algorithm 7 presents the Louvain algorithm invented by Blondel et al., 2008. This 
method is called Louvain because the authors of the original paper were at that time 
based in Louvain University in Belgium. 


Remark 4.2. Algorithm 7 requires to compute the change of modularity when one 
node is moved from one community to another. This can be done in constant time, 
as Lemma 4.5 shows. 


Remark 4.3. The most time consuming pass of Algorithm 7 is the first pass, where 
we have |E| changes of modularity to compute. The ensuing passes are faster, as 
they deal with much smaller graphs. Thus, a simple complexity estimate is O(|E|), 
which is much better than the complexity of the greedy algorithm. 
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Algorithm 7: Louvain algorithm for fast modularity maximisation (Blondel 
et al., 2008). 
Input: adjacency matrix A. 
Output: node labelling 2 = (21,..., Zn). 
Step I: 
e assign each node to a community of its own, starting with n communities of 
single nodes (in other words, set z; = 1); 
e for each node ż, evaluate the gain in modularity if we place node 7 in the 
community of one of its neighbors j; 
e move node ż in the community for which the modularity gain is the largest, but 
only if this gain is positive. If no positive gain is found, z stays in 
its original community; 
e apply this process to all nodes until no further improvement can be achieved. 
In particular, a node can be moved several times. 


Step II: construct a network whose nodes are the communities identified in 
Step I, and where: 


e the weight between two communities is the sum of the weights of the links 
between the nodes in the corresponding communities; 
e the link between nodes of the same community lead to weighted self-loops. 


1 Step II being completed, repeat Step I and then Step II (we call it a pass). 
Each pass decreases the number of communities. The passes are repeated until 
there are no more changes and a local maximum of the modularity is attained. 
Return: 2. 


4.2.4 Discussion 


Unlike spectral methods, modularity based methods do not require the a priori 
knowledge of the number of blocks. Moreover, the speed difference observed in 
practice makes the greedy method (Algorithm 6) out of the competition. Further- 
more, it has also been empirically observed that Louvain returns partitions with 
high modularity. Table 4.4 gives the performance of Louvain method on real data 
sets. In particular, we see that Louvain has the tendency to predict a high number 
of communities, but with a modularity higher than the ground truth partition. 

We plot in Figures 4.8 and 4.9 the predicted communities by Louvain on the 
karate club and the political blogs datasets, respectively. By comparing to the ground 
truth, we observe that Louvain split the ground truth communities into smaller 
communities. It results in configurations with larger modularity that the ground 
truth ones (see Table 4.4), and the ground truth could be recovered almost perfectly 
by merging the small community predicted by Louvain into larger ones. 
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Table 4.4. Performance of Louvain algorithm on real data sets. K and M refer to the 
number of clusters and the modularity of the ground truth partition, while K and M refer 
to the predicted number of cluster and the predicted modularity by Louvain algorithm. 


Data Set Structure Ground Truth | Louvain 
Name n E K M |K|M 
karate club 34 78 2 0.36 4 | 0.42 
dolphins 62 159 2 0.37 5 | 0.52 
political blogs 1222 16717 | 2 0.41 11 | 0.43 
DBLP top2 13326 | 34281 2 0.44 76 | 0.86 
LiveJournal top2 | 2766 | 24138 | 2 0.38 21 | 0.59 
cora 2485 5069 7 0.63 24 | 0.80 
citeseer 2110 3668 6 0.52 37 | 0.85 
wikivitals 10012 | 629527 | 11 0.31 8 | 0.44 


a) True labels. b) Labels predicted by Louvain. 
(a) p y 


Figure 4.8. Comparison of the ground-truth communities and the communities predicted 
by Louvain algorithm on the karate club dataset. 


a) True labels. b) Labels predicted by Louvain. 
(a) p y 


Figure 4.9. Comparison of the ground-truth communities and the communities 
predicted by Louvain algorithm on the political blogs dataset. 
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(a) Modularity. (b) Number of blocks. 
Figure 4.10. Boxplot of the modularity and number of clusters obtained by Louvain algo- 
rithm on different random graphs without community structure. We computed 100 ran- 


dom graphs with n = 1500 nodes. ER refers to Erdés-Rényi model with p = a CM to a 
configuration model with a Zipf law of parameter 2 for the degree distribution, and PA is 
the simple preferential attachment model as described in Section 2.2.2. 


4.3 Bayesian Community Detection 


4.31 An Over-fitting Issue? 


Interpreting the result of any modularity maximisation algorithm should be done 
carefully. Indeed, partitions with high modularity can be found in random graph 
models without any community structure. Figure 4.10 shows the output (both 
modularity of the predicted partition and predicted number of clusters) of Lou- 
vain algorithm on Erdős-Rényi, configuration model and preferential attachment 
random graphs. The modularity found is high, especially in Erdős-Rényi and pref- 
erential attachment random graphs, albeit those graphs have by construction no 
community structure! Furthermore, partitions with high modularity are also found 
in configuration model, which is supposed to be the modularity’s null-model. We 
emphasis that this is an intrinsic problem of modularity maximisation, and not a 
side-effect of the Louvain algorithm. 

Cut-based methods are also prone to overfit. Figure 4.11 shows that using nor- 
malized spectral clustering on Erdős-Rényi random graphs leads to partitions whose 
cut represents between 15% and 30% of the total number of edges (depending on 
the number of clusters chosen). In other words, spectral clustering finds commu- 
nities in the middle of pure randomness! 


4.3.2 Principled Approach 


To avoid the over-fitting issues and to find statistically significant communities in 
networks, we now explore a Bayesian approach. Bayesian community detection aims 
at determining which community labelling z € [⁄]” is responsible of the network 
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Figure 4.11. Boxplot of the proportion of out-going edges obtained by normalized spec- 
tral clustering for various K on Erdős-Rényi random graphs with p = 0.01. 
A by maximising the posterior distribution P (z | A). Bayes law gives 


P (4| z)P (2) 
P (4) 


The quantity P (A) in the denominator is the evidence, that is the probability of the 


P(z|A) = 


observed data, and does not depend on z. 

The quantity P (A | z) is the marginal likelihood. We will make the assumption 
that the network was generated according to the Poisson version of the homoge- 
neous DC-SBM (see Section 2.3.2). Therefore, P (A | z) is equal to 


[Pazo Ola dodo. (4.18) 


In particular, P (A | z, œ, 0) is equal to“ (see Proposition 2.8) 
o d 
Mkk kk Me eo MRM Oke i 
I] ow I] o II 0, 
1<k<K 1<k<€<K 


where mge = Dic AG = = k)l(z = £). We select a uniform prior for 0 that 


imposes the normalisation condition >, 0;1 (z; = k) = n4 for all k. Hence 
PO\z) = | [0u - Dah 1G = k) -n 
k i 


Finally, we recall that for a continuous random variable X € [0,00) with con- 
strained average x, the maximum entropy distribution is the exponential distribu- 
tion whose density is f (x) = e~*/*/x. Thus, we chose an exponential prior for ee, 


4. uptoaterm[],_ jAy! that does not depend on z. 
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such that 


eke /@ 


P (wg |z) = = 


where © = 2|E|/n* corresponds to the average edge probability in the network. 
Performing the integral over œ in Equation (4.18) leads to 


| ! 
JT = My+1 Il- e meet [Te P 0 | z) dð, 
k © (4 + 4) k<t © als + ngne ) 


where we used eB ody = for a, 6 > 0. To carry out the last integral over 
0 yori y g 
0, we notice that for all k, 


[I [29 S-n do; Meor 


iec, ` i€Cp (Se di + 1): 


where Cy = {i: z; = k}. Therefore P (A | z) equals 


! 
I] “ Myy+1 I] 


mpe+ 1 
k (140% i) kee fone 


ntl (me—-1)! oF ITT, di! 
n 
E (apto 1)! Thicj4! 


k 


where vg = >), di1 (z; = k) is the sum of the degrees of nodes in block &. 

Let us now study the prior distribution P (z). In particular, the choice of prior 
should not make any a priori assumption on the number of (non-empty) groups 
and on the number of nodes in each groups (allowing groups of different sizes). Let 


P(z) = P (2| {74) P da} | K) P(K) 


where K denotes the number of non-empty groups in o , and mz denotes the num- 
ber of nodes in community k. We firstly have P (K) = 1 (the prior is agnostic 
about the number of blocks). Then, recalling that (%2 a) counts the number of 
ways to divide n nonzero counts into K nonempty bins, the probability that the 
K blocks have sizes 71,...,2K is P({mg}|K) = =) 
K-1 


domly sampled block-sizes {7s}, the partition is sampled with a uniform probability 


. Finally, given the ran- 
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P (0 | {m}) = Thm! Ma 1 . Therefore, 


Using the expressions of the marginal likelihood and the prior leads to the max- 
imisation of 


1 m mpg! nyt (ng)? I mpe! 
(E (: 43 or (mp + ve — 1)! gp (l + Onne)” t 


over all possible community labelling z € [7]”. 


4.3.3 Markov Chain Monte Carlo Algorithm 


While the above likelihood-based maximisation problem is hard, we can employ 
Markov Chain Monte Carlo (MCMC) importance sampling approach for find- 
ing a good approximate solution (Robert and Casella, 2013). We start from some 
initial labelling 2. At each step, we propose a modification z’ of the labelling 
z), This modification is accepted with probability min {1 soe |. If 


GH), 


the move is accepted, then z’+)) = 2’, otherwise z ©., This acceptance 


probability is known as the Metropolis-Hastings criterion, and enforces the detailed 


balance (Metropolis et al., 1953; Hastings, 1970a). Computing Ta a a iss O(d;) 
time-complexity using the previous computations (in particular, we do not need to 
compute the evidence P(A) as it cancels out). 

The simplest move proposal consists to select a node uniformly at random and 
choose its new community membership z; between the K+ 1 choice (the K existing 
groups plus the possibility to assign 7 to an empty group). This direct approach is 
inefficient, as the mixing time of the Markov Chain might be enormous. A better 
approach (Peixoto, 2014a, 2019) consists in choosing the new group membership 2, 
according to 


eee +E 
P (z; = €|z) = PHA EED eG EET, 


where P(k]) =>) E u is the fraction of neighbors of 7 belonging to group 


k, and € > 0 is a parameter enforcing ergodicity. We can interpret this probability 
as firstly choosing a node 7 uniformly at random, and sampling a neighbor j of 2, 
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Figure 4.12. Marginal posterior probability of the number of groups (Figure 4.12(a)) for 
the karate-club network, under the assumption that the network is a realization of a 
degree-corrected SBM. Figures 4.12(b) and 4.12(c) show example of partitions obtained. 


whose community label is g = k. Then, 


(i) with probability PETES] we choose a community label € at random 


among the K + 1 possibilities (it can be an empty group); 


(ii) otherwise, we sample a group label £ with probability TAKT: 
This procedure can be performed in O(d;) time-complexity, provided that we 
keep track of the edges incidents from each group, which incurs O(E) memory- 


complexity. 


4.3.4 Numerical Results 


The MCMC algorithm described in this section is implemented in the graph-tool 
library (Peixoto, 2014b), available at http://graph-tool.skewed.de. 

We first analyze the performance of Bayesian clustering on synthetic networks. 
We generate DC-SBM graphs. 

The MCMC procedure for the Bayesian framework gives access to the poste- 
rior distribution, instead of just finding its maximum. We can in particular obtain 
the marginal probabilities of group memberships of the network as well as marginal 
probability on the number of groups. In particular, we plot in Figure 4.12 the results 
obtained on the karate-club network. In particular, we observe on Figure 4.12(a) a 
large probability for the network to have one or two communities and the configu- 
rations with larger numbers of communities are much less likely. Recalling that the 
ground-truth corresponds to the situation after the feud between the main instruc- 
tor and the club’s president, we can interpret the large posterior on the one commu- 
nity case as the network before the feud, in which no communities were present at 
that time. When Bayesian clustering predicts two communities, we observe differ- 
ent configurations. Some predictions do align with the two communities observed 
after the feud (see Figure 4.12(b)), while other configurations tend to group the 
large degree nodes, ‘influencers’, together, and the low degree nodes, ‘followers’, 
forming the second community (see Figure 4.12(c)). 
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ER CM PA 
Figure 4.13. Boxplot of the number of clusters obtained by the Bayesian framework on 
different random graphs without community structure. The setting is the same as in 
Figure 4.10. We computed 100 random graphs with z = 1500 nodes. ER refers to Erdős- 
Rényi model with p = a CM to a configuration model with a Zipf law of parameter 2 for 
degree distribution, and PA is the simple preferential attachment model as described in 
Section 2.2.2. 


Number communities 
e N wù A n OQ N 


To finish this section, we show in Figure 4.13 that Bayesian clustering applied 
to random graph models with no community structure predicts in most situation 
only one community: the over-fitting issue is now absent. 


4.4 Theoretical Analysis 


4.41 Modularity and Maximum A Posteriori Estimator 


In this section, we consider the adjacency matrix A of a random graph G sampled 
from a homogeneous degree-corrected block model, where the edges are Poisson 
distributed (see Section 2.3.2). More precisely, Ajj = 0 and for i #7 

POO On), if z = 2? 


lj; J? (4.19) 
P(GO;Ooux), otherwise, 


where P(q@) denotes a Poisson random variable of parameter œ and d; is the 
degree of node 7. Similarly to the DC-SBM, we assume that for all k € [K], 
> 91 (z? = k) = 1. Proposition 4.4 shows that for such a block model, the 
Maximum A Posteriori (MAP) estimator defined by 


gMaP — arg max P(z | A) (4.20) 
ze[K]” 


corresponds to maximising a quantity resembling the modularity. 


Proposition 4.4. Let A be the adjacency matrix of a block model graph with K blocks, 
n nodes, with uniform prior probability for the node labels and where the edges are 


Theoretical Analysis 93 


sampled independently according to (4.19). Then, the MAP estimator defined in (4.20) 
verifies 


A Win — W 
zMAP _ sema (4 — aeeoa) 1 (z; = zj) : 


Proof. The Bayes formula gives 
P(z|A) x P (A| z)P (z), 


where the proportionality hides the term P(A) independent of z. Moreover, 
Pe) = [Pe = 2 
z) = z) = >: 
i=1 = 


and hence P(z) is also independent of z. Therefore, 


arg max P(z |A) = argmaxP(A|z), 


ze[K]” ze[K]” 
Ay 
0,0;0;) 9 -00o 
and P(A|z) = Hj ee 99), where 
ij! 


Win> if z; = Zj, 
Oj; = 


Wout Otherwise. 


Thus, 
logP(A|z) = $ (Ag log (6:0je4) — 0,00) — $ log(4y!). 
i<j i<j 
1 
= 5 5 (Aj log (0,0;0;) — 0;0;0;;) — -> log (Ajj!) 
all i<j 


The last term 7; <; log(Aj!) is independent of the model parameters and do not 
affect the position of the maximum. Moreover, we note that 


OF = (@in — ®out)1 (zi = zj) + Oout- 


(To show this, simply notice that the left hand side equals (Win — Mour) X 0 + 
Oout = Mout when z; # zj, and equals (Win — Oout) X 1 + Oout = Win when 
Zi # Zj, hence it corresponds to the definition of œz.) Similarly, 


log(O,0}0r5) = (1080:00) — 10g(0:0j0040)) 1 (zi = z) + log >) 
1 (z; = zj) + log(9;O;@out): 


log = 


Wout 
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Therefore, 


1 i 
logP (4| z2) = spa (4y oe 2 


— (@in — Wout) 0) 1 (z; = zj) +C, 
=F Wout 
ifj 


where C = 5 pares (Ay log(O;O;@our) — 0,0;@our) — Dic; log(Aj!) is a constant 


term independent of z. Hence, we obtain 


1 Qin Min — Oou 
log P (A|z) =e D(a- eaea): (zi =z) +C 


out ifj 


Oout 


and this ends the proof. 


Recall that the modularity was defined by equation (4.17) as 
1 
M (z) — JEI S (Ay = P;) 1(z; — zj)» 
Zi = 


where P; is the probability of an edge between ż¿ and j under a null-model, and the 
chosen null-model was the configuration model. Proposition 4.4 gives something 
similar. Indeed, we can write 


MAP _— arg max r (Ay — y Pj) 1 (z; = z;), 


z T 
J 


— in out eo ae fi int (K—1)@our 
where y = 7! og Cin ia HC ns pagg and Py = 0:0; ra corresponds to the 


expected probability of observing an edge between 7 and j under the null-model 
(see Definition (4.19)). Moreover, the expected degree of a node ż is equal to d; = 
2 66:45 = 9; Pt Dto, while the expected number of edges is equal to 


= fe 
m= 1 Gob D, Hence, Pi = =~, and we recover 
d; dj 
zMAP 
= E 2 ij l (z; = zj) 
ze[K]” 


The quantity inside the arg max resembles to the modularity as defined in (4.17), 
with an extra parameter y. One can define the regularised modularity (Reichardt 
and Bornholdt, 2006; Arenas et al., 2008) as follows: 


M, @) = Š (Aj — 7 Py), (4.21) 
ij 


did; 


P 
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Hence, the MAP estimator is equivalent to the maximisation of the generalized 


modularity, with P = a and y as defined earlier. Unfortunately, the relevance of 
this equivalence is rather limited, since the parameters @in, @our (and hence y ) are 
usually unknown, albeit various strategies have been proposed for their estimation 
(see more on this in (Newman, 2016)). 


4.4.2 Normalized Spectral Clustering as a Continuous 
Relaxation of Modularity Maximisation 


The goal of this subsection is to link a particular relaxation of modularity max- 
imisation to normalized spectral clustering. For simplicity of derivations, we will 
restrict the consideration to the case of two clusters. Recall that the maximisation 
of the generalised modularity is given by 


Z= a D4 2) 1(z; = zj). 


—1,1}" 


Noticing that 1(z; = Zi) =F (ziz; + 1), we can rewrite it as 


Z = argmax > BijZiZj, 
ze{—1,1}” ij 


where B is the matrix whose entries are By = Ay — y a ae As done for the cut- 
based methods (Section 4.1), we can simplify the problem by relaxing the discrete- 
ness of z € {—1,1}” to a real-valued vector x € R” (Newman, 2013). However, 
a constraint should be added to prevent x; from becoming arbitrarily large, że., to 


d;d; 
prevent the term (4; - 7 aH) x;x; to become large in a trivial way. A straightfor- 


ward constraint consists in fixing x onto the hyper-sphere by imposing X., x? = n. 
This in particular sets the limit — y/n < x; < y/n, while imposing the €?-norm 
of x to be equal to n. More generally, we can fix x to a hyper-ellipsoid by letting 
>, Kix? = >), 4; for a vector K = (K1, . . . , Kn) of non-negative entries. In partic- 
ular, for x; = d; this leads to the following problem 


A 


x = argmax x? Bx, 
xeR” 
xT Dx = 2|E| 


T 
where we used x" Bx = 1; ; Bijxixy. 
The Lagrangian assented to the above problem is 


Be = À (x7 Dx — 221) ; 
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and equating the derivative with respect to x to zero gives 
Bx = ADx. (4.22) 


Thus x is a solution of a generalized eigenvector equation corresponding to an 
eigenvalue 4. To know which value of 2 we should consider, we note that for a 
generalized eigenvector x, we have x! Bx = Ax’ Dx = 42|E|, and hence the mod- 
ularity x7 Bx is highest for the largest eigenvalue 4 of the generalized eigenprob- 
lem (4.22). 

Since B1, = (1 — y)D1,, 4 = 1 — y is an admissible solution of (4.22). 
Therefore, if the maximum eigenvalue is 1 — y, then the best partitioning corre- 
sponds to not dividing the network at all. We rule out this case, and thus assume 
A> 1— y. Noticing that Bx = Ax — y D1,a3 with d = (di, ... , dy), we rewrite 
problem (4.22) as 


d' x 
Ax = D(a 1,—— }. 
( ma D 
Multiplying on the left by 17 leads to d7x = (A + y)dTx = 0 (we used 17A = 
1TD = d? and d1 = 2|E|). Since 2 > 1 — y, this in turn implies d'x = 0, 
and problem (4.22) simplifies to 


Ax = ADx. 


We notice that the constant vector x = 1, isa solution, and according to the Perron- 
Frobenius theorem it is associated to the largest eigenvalue since all its elements are 
positive. Nonetheless, we rule out this solution since it does not verify d’x = 0, 
and we consider the second largest eigenvalue 4. Rescaling by y = D!/*x leads to 
the standard eigenvalue problem 


DoV24p-'/2y = dy, 
or, equivalently, 
Ly=(1—-A)y 


to use the normalized Laplacian L = I, — D~'/?AD~!/2. Hence, y is an eigen- 
vector of the normalized Laplacian. The link with normalized spectral clustering is 
completed by noticing that 4 should be the second largest eigenvalue, and hence 
1 — J should be the second smallest eigenvalue of £. 
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4.4.3 |Information-theoretic Results for Consistent Recovery in 
SBMs 


This section presents information-theoretic results about recovery consistency in 
SBMs. 


Non-binary SBMs 


Let us first generalise SBMs to network with non-binary interactions. We denote 
by S the space of interactions, and by fin and four the interaction densities (with 
respect to a measure 4). These parameters specify a probability measure on a space 
of observations 


A= [a = (ai) e S”*” such that aj; = dji, a; = 0 for all inj} 
with probability density function 


Plz) = [| fez (a9) (4.23) 


l<i<j<n 


with respect to the n(n — 1)/2-fold product of the reference measure u. 

In other words, for an observation A distributed according to (4.23), the entries 
aj, 1 < i < j < n, are mutually independent, and a; is distributed according to 
fin when z; = zj, and according to fou: otherwise. In particular, when S = {0, 1} 
and fins four are Bernoulli distributions, we recover the binary homogeneous SBM 
as defined in Section 2.3.1. When S = Zand fin, four are Poisson distributions, we 
recover the Poisson SBM (see equation (2.7)). 

The node labelling z representing the block membership structure is an 
unknown parameter to be estimated. We consider the node labelling as a random 
variable distributed according to the uniform distribution z(z) = K~” on the 
parameter space Z = {z € [K]”}. In this case the joint distribution of the node 
labelling and the observed data is characterised by a probability density 


P(o, A) = z,P(A|z) (4.24) 
on Z x X with respect to cardz x u, where cardz is the counting measure on Z. 


Regime of asymptotic recovery 


We recall that the Hamming distance between two sequences y, z € [K]” is defined 
as the number of positions at which the corresponding symbols are different, że., 


dian) = > 10; # 2). 
i=l 
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For an estimator Z of node labelling z € [K]”, we define the absolute classification 
error as follows: 


diam (2, z) = min 1 (t o (2) Æ zi) . (4.25) 


This corresponds to the number of misclassified nodes by the estimator Z up to a 
global permutation’ t € Sx. 
When analysing the average performance of an estimator, we can view 2: A € 


At 2(A) e [K]” as a [K]”-valued random variable defined on the set of obser- 
vations A. Then, E,d;,,,(2, 2) is equal to the expected clustering error given true 


node labeling z, and 


Edtiam@) = >. @@)Ecdfiam (22) 


ze[K]” 


is the average clustering error with respect to the node labelling distribution z on 
the parameter space. 

We say that the estimator 2 asymptotically achieves exact recovery, or equivalently 
that Z is a strongly consistent estimator of z, if 


Edttam (2) > 0 as n> oo. (4.26) 


Condition (4.26) means that asymptotically every node is correctly classified. This 
demand is often excessive. A more reasonable setting is when only a vanishing frac- 
tion of nodes is misclassified (that is, at most o(7) nodes are misclassified). We say 
that estimator Z asymptotically achieves almost exact recovery (or that Z is a consistent 
estimator) if 


n Edim (2) > 0 as a> ov. 


Remark 4.4. Exact and almost exact recovery are two the most studied regimes 
of cluster recovery. Another regime is called detection and is much weaker (it cor- 
responds to the regime when there exists an estimator performing better than a 
random guess). This condition is weaker and therefore holds even if the graph is 
very sparse (for example, when the average degree is constant). We will not discuss 
this regime here, since the proof techniques are very different. We refer the reader 
to Moore, 2017. 


5. The presence of the global permutation in this definition is justified by the observation that permuting all 
the node labels do not change the overall clustering. 
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Information-theoretic conditions for consistent recovery 


Definition 4.3 (Rényi divergence). The Rényi divergence between two probability 
distributions f and g is defined as 


GEN Y2 f de\ "2 
paro = af (E) (E) a 


where u is an arbitrary measure which dominates f and g. We use the following 
conventions: log0 = —oo, 0/0 = 0, and x/0 = œ for x > 0. 


Remark 4.5. The Rényi divergence is linked to the Hellinger distance, Hel(f, g), 


2 
defined by Hel? (f, g) = ss (v2 -/#) du, via the formula Dj /2(f,g) = 
—2 log (1 — Hel?’ (f, g)). 


In what follows, we assume that a sigma-finite reference measure u on S is 
fixed once and for all, and we write =. E simply as f, g, and we omit d u from 
the integral signs, so that D1/2(f,g) = —2 log f //fg. When S is countable, u 


is always chosen as the counting measure, in which case we write D12(f, 8) = 


—2 log Dees Vf E). 


Theorem 4.6. Consider a homogeneous SBM with n >> 1 nodes, K = 1 blocks, and 
interaction distributions fin = fP and four = f over S = S™. Let I = In be the 
Rényi divergence between f and g. The following holds: 

(i) a consistent estimator exists if | >> n~\ and does not exist if | < n7; 
K logn 


->, and does not 


(ii) a strongly consistent estimator exists if 1 > (1 + Q(1)) 
exist if I < (1 — og =, 


Theorem 4.6 shows that the Rényi divergence governs the possibility or impos- 
sibility of (strongly) consistent recovery in non-binary SBMs. In fact, when interac- 
tion distributions fin and four are too similar (in the sense that their Rényi divergence 
is smaller than 771), then there is not enough information provided by the network 
to recover the communities consistently. 

Theorem 4.6 is proved in Avrachenkov eż al., 2022. The literature on consistency 
thresholds in SBMs is large, and one can refer to Zhang et al., 2016 for binary SBMs 
(S = {0, 1}) and to Jog and Loh, 2015; Xu eż al., 2020 for weighted (S = R) or 
edge-labelled (S = {0,1,--- , Z}) SBMs. 


100 Community Detection in Networks 


Application to binary SBMs 


Let us see how we can apply Theorem 4.6 to sparse binary SBMs, for which fin = 
Ber(pin) and four = Ber(Pour) with pins Pour K 1. A Taylor expansion gives 


Dy/2 Cras Foe) —2 log (va — Pin) (l = Pout) + Vimba} 

in + Pout 
—2 log (:- P A + «J Pin fous + Opapo) 
= ox( 1 - Ta a 


II 


+ Opapo) ) 


(VPin — Pn =F O(pinpour). (4.27) 
This can be applied to the following two particular cases. 


Example 4.1. In a regime when pin = apy and Pour = bpn for scale-independent 
constants a # b and with p, < 1. Theorem 4.6 and equation (4.27) tell that a 

consistent estimator exists if 7), >> 1, and does not exist if npn < 1. We ani that 
the key quantity mp, is of the same order as the expected degree d, = atbnp,,, 
Thus, for the possibility of consistent recovery we require that the expected degree 
diverges with the size of the network. 


ae N a N 


for scale- 


Example 4.2. In a regime where pin = and Pou. = b 
independent constants 4, b, Theorem 4.6 and UN (4.27) tell a a strongly 
consistent estimator exists if (,/a—~/b)? > K and does not exist if (V/a — Vb)? < 
K. This is the well-known threshold for strong consistency in binary SBMs (Abbe 


et al., 2015; Mossel et al., 2015). 


Remark 4.6. Considering the setting of Example 4.2, we see that for K = 2 strong 
(a) 


consistency requires 3 = ab vab > 1. Since ete > 1 isthe condition 


for connectivity in this SBM (see Theorem 2.2), this means ae exact recovery in 
SBM is a strictly stronger requirement than connectivity.” 


Other Particular Cases of Non-binary SBMs 


Example 4.3 (Poisson interactions). The Rényi divergence between Poisson dis- 
tributions with means / and x is exactly equal to J = (VA — WIDE In a regime 


l 
when A = a “s" 


and u = peer with constants a, b > 0, Theorem 4.6 tells that 


6. Ifthe graph was not connected, then a.s. the graph would contain isolated nodes (see Lemma 2.4). Hence, 
exact recovery would not be possible, as one could not correctly classify the isolated nodes better than a 
random guess. Therefore exact recovery requires connectivity, but Example 4.2 shows that connectivity alone 
is not enough. 
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a strongly consistent estimator exists if (/a — Vb)? > K and does not exist if 
(J/a- Jb) < K. This condition is similar to the condition in Example 4.2, and 
is due to the fact that Poisson distributions with small mean are well approximated 
by Bernoulli distributions. 


Example 4.4 (Censored block model). Let us consider a latent binary SBM 
with fn = Ber(p~o) and four = Ber(go) for which each interaction and non- 
interaction are revealed independently with probability r = ro 82, where we 
assume that po, go and ro are constants. The resulting observed network is a non- 
binary SBM with interaction space S = {present, absent, censored} (where cen- 
sored denotes the unobserved interactions) and with intra-block and inter-block 
probability distributions foit and fo. We have foulpresent) = = rpo, four(absent) = = 
r(1 — po) and fu (censored) = = 1 — r, and similarly for fn. From Di a (huo fn) = 


r(( VP = JD) + + (V1 = po — /1 = 40) °) + O (r°) it follows that a strongly 


consistent estimator exists if rọ > 76'" and does not exist if rọ < rgi t where 


= x z. For K = 2, this coincides with the critical 


(v-v) +y 1—0- 4/ 1—70) 


threshold obtained in Dhara et al., 2022. 


4.4.4 Consistency of Spectral Methods in SBM 


In this section, we will prove that spectral clustering is consistent in the SBM. For 
simplicity, we will consider spectral clustering using the graph adjacency matrix, 
but a similar proof would hold if one were to use the normalized Laplacian. 


Heuristic: mean-field model 

We first consider the mean-field model of the SBM, that is the model where all the 
random quantities are replaced by their expectations. In particular, the mean-field 
graph becomes the weighted graph formed by the expected adjacency matrix of a 
SBM graph. Therefore, if (z, G) is drawn from SBM (7, 7 , Q), then the adjacency 
matrix of the corresponding mean-field is 


EA = ZQZ', 


where Q e [0, 1]*** is the rate matrix (recall that element Qe denotes the proba- 
bility of edge appearance between a node in community k and a node in community 
£), and Z € {0, 1}”** is the membership matrix defined by 


i if z =k, 
Lik = 


0, otherwise. 


The following lemma specifies the eigenstructure of EA. 
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Lemma 4.7. Assume Q is full-rank, and let UDUT be an eigendecomposition of 


EA. Then U = ZX, where X € R*** and |X}, — Xex = Jn! + ny | for all 
1 <k <€< K, and where Xy, denotes the row k of X. 


Proof. Let A = diag(,/m1, ...,,/nx). Then, we can write 


BA = ZQZ! = (ZA!) (AQA)(ZA7)". 
The matrix Z AT! is orthonormal. Indeed, 
(ZA7)7 ZAT! = ADZ™ZA" = I, 


where we used the fact that ZTZ = diag(m,...,2K) = A, 
Let RDRT be the eigendecomposition of A QA. Thus, 


BA = (ZA~'R) D(ZAg!R)* 


is the eigendecomposition of EA. We finish the proof by letting U = ZA~!R and 
X = AİR. We then have 


XXT = diag (n', sink ag) : 
Hence, 


[Xeull + Xel — 2X XE 
= n,'+n7' +0, 


Xx — Xex Il 


and the claim follows. 


In particular, Lemma 4.7 ensures that the community information is encoded 


in the eigenstructure of EA. Indeed, the K eigenvectors associated with non-zero 


eigenvalues of EA are given by the columns of U, which can be written as ZX. The 


k-means step (see equation (4.10)) then aims to recover Z (and X) from U. 


Consistency of spectral clustering in SBM 


We established that if one were to observe the mean-field graph, then recovery of 
communities would be possible by looking at the K leading eigenvectors of the 


mean-field adjacency matrix EA. The following theorem states that, under some 


natural conditions, consistent recovery is possible by looking at the K leading eigen- 
vectors of the random graph adjacency matrix A. We recall that the absolute clas- 
sification error dj,,,(% z) is defined in (4.25), and an estimator is consistent if 
dt. (52) 
Ham \*? = 
2 = o(1). 
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Theorem 4.8. Let (z, G) ~ SBM(n,7,P), where P is of rank K with smallest 
absolute nonzero eigenvalue larger than yn. Let dy be the expected degree and z € [K]” 
be the output of spectral clustering applied to the adjacency matrix. Then, there exists a 


constant c > 0 such that if (2 + € xo < c, then with high probability 


Kd, 
< (2 + eyo. 


n 


Bam (z, z) 


Example 4.5. Consider a homogeneous SBM with Pze = pin if k = € and Pye = 
Pout Otherwise. Then, d, = K (Pin + (K - 1)Pout) while y, = % (Pin — Pout). 
Assume that Pin = CinPn» Pout = CourPn Where cin, Cour does not depend on 7, and 
suppose that the assumptions of Theorem 4.8 hold. Then, the error of spectral 
clustering is bounded by 


damn z) Gn + (K — Di) 1 
dy 


Cin — Cout 


< (2 +6)cK ( 


This upper bound goes to zero when the average degree d, goes to infinity, ensuring 
consistency of spectral methods in such a setting. 


The intuition for the proof of Theorem 4.8 is as follows. 


© Show that the K leading eigenvectors of the adjacency matrix A are not too 
different from the K leading eigenvectors of the expected adjacency matrix 


EA. This is done in two steps. 


— First use a result from random matrix theory to show that A is concentrated 
around EA. This is Theorem 4.9. 
— Then use this concentration to show that the eigenvectors are also con- 


centrated. This is usually done using Davis-Kahan theorem. We present a 
variation of it in Lemma 4.10. 


© Conclude by bounding the error made by the k-means step. 


Theorem 4.9 (Theorem 1.2 of Le et al., 2017). Let A be the adjacency matrix of a 
Bernoulli random graph G (n, (pij), and let dy = n max; pij. Fort ~ d, define A; = 
A+ t1,12 the regularized adjacency matrix. Then, we have with high probability 
when n goes to infinity 


l4: — EArlly = O (V4). 


The proof of Theorem 4.9 is complex and out of reach for this book. We will 
simply note that the regularization term t1,,1/ is needed to ensure concentration 
of the adjacency matrix when d, is small. Indeed, consider an Erdés-Rényi graph by 
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letting pj = p. If d, K logn, then the degree of some nodes are much larger than 
the expected degree dy = np. This implies that some rows of the adjacency matrix 
will have £? norms much larger than dp, which in turn imply ||A — EAl| > Jd). 


Lemma 4.10 (Principal subspace perturbation). Let M e R”*” be a symmetric 
matrix with smallest nonzero singular value y , and let M be any symmetric matrix. 
Denote by U and U e RR"** the matrices whose columns are composed of the K 
leading eigenvectors of M and M. Then, there exists a K x K orthogonal matrix Q 
such that 


|vQ-Ul, < 


Lemma 4. 10 is a version of Davis-Kahan ’sin 0’ theorem that bounds the distance 
between two subspaces spanned by the leading eigenvectors of two matrices. We 
refer to the Theorem 2 of Yu et al., 2015 for further explanations. 

The last ingredient needed for the proof of Theorem 4.8 is a bound on the error 
made by the k-means step. The next lemma gives such a bound. 


Lemma 4.11 (Approximate k-means error bound, adapted from Lemma 5.3 of Lei 
and Rinaldo, 2015). Fore > 0 and any matrices V, V € R”™* such that V = ZX 
with Z € Z„K, and X e RK*K, let (ZX) be a (1 + €) approximation of the 
k-means problem (4.10). We denote by z and z the membership vectors associated to 
the membership matrices Z and Z. Let nmin be the size of the smallest community, and 


ô = minge: kzt ||Xex — Xex ll. f4(2 + €) — E liia 1v-71 < Amin then 


dilam (> z) 2 IV — Vile 


< 42+€) 527 


Lemma 4.11 upper bounds the error made by the k-means step. The bound 
involves the Frobenius distance between V (the matrix with the eigenvectors of 


EA, which we know from the mean-field study Lemma 4.7, can be written as ZX 


and from which we can recover the community structure Z), and the matrix V (the 
matrix with the eigenvectors of A). 


Proof: Denote by V =ZX. Intuitively, we want to show that if V is close to V, then 
V is close to V as well, where V is the solution of the minimisation problem (4.10) 
with the objective function V. Let Cg := {ż: z; = k} be the set of nodes belonging 
to community k, and By := {i € Cx: Vix — ZX) els > 6/2}. The sets By 
corresponds to nodes for which the k-means solution V = ZX is far away from 
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the matrix V. Let us first show that the sets By are of small sizes. We have 


|V-22|; = ¥ (Ere 2 
j=l =l 
= D7- R) 


= X D |7- E2) 


and thus, 


|” - ZX; > TY [ia -( ZX),,| > ax 


k=1 icb, 


Hence, 


K | or 
> iB = a |V — 2X; 


á as SSi \2 
< 5 (7 -vle+ V-22] 
4 2 = 
< a(it+vi¥e) |v -vVF. 
4 - 
saeed r-a (4.28) 


since |Z — JF < (1+e€) [zx — vi? for all Z’, X’ € Z„g eRe", 
Using the assumption of the lemma, we have Y |B| < nmin. Hence, for 
every k € [K], the sets C,\B, are non-empty. We will now claim that: 


(i) Ifi € Cy\By andj € Ce\Be with k # £, then Vix £ Vrs 
(ii) For 2,7 € Cy\Bg, we have Vix = Vj. 


Therefore, every node i ¢ U% B; can be assigned to a class 2; based on the value 
of the row z of V. Let o* € Sx be a permutation satisfying 


o* € arg min bJ 1 (0 (2;) Æ zi). 


ESK . 
TESK ig UK, Be 
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For such o*, we have X7 1(o (2) # zi) < S |B;|. Thus, 


n K 
ae < > 1G) #2) < >) B 


i=1 k=1 


and we conclude that the lemma's claim is true using equation (4.28). 

Let us now show claim (z). If we were to assume that Vn = Vins then this would 
imply ô < ||Vis — Vielly < Il Vie — AVi«lly + Il Vis = Viele < 6/2 + 6/2, which 
is a contradiction. 

Finally, let us show claim (72). Since Ze Zn,K and X e RK*K then V has at 
most K distinct rows. We also know from claim (2) that V has at least K distinct 
rows. Hence, V has exactly K distinct rows, and this in turn implies that Viz = Vin 


for ¿j € Cg\ Bg. 


Proof of Theorem 4.8. Let V (resp. V) be a n-by-K matrix whose columns are com- 


posed of the K leading eigenvectors of A; (resp. of EA,). Combining Lemma 4.10 


and Theorem 4.9, we have for some orthogonal matrix Q € R&** 


_ 2/S2K 242K = 
\vQ-Vvl, < We - Edel < cv, 


(4.29) 


with high probability. 

We will now directly apply Lemma 4.11 to V and VQ. Lemma 4.7 shows that 
VQ = ZXQ = ZX' with X’ = XQ, where Xp, — Xex I = la + ia Therefore, 
we can choose ô = 1/,/Mmax. Using equation (4.29), a sufficient condition for 
4(2 + €) + Iv- WV < Amin to hold is 


d 
4(2 + €)8C°*K— < Aminmax- 
7 


n 


Therefore, we can apply Lemma 4.11, which states that 


diam È 2) IV — VQli; 7 
< 21042) 1—— E < 4(2 28C? an 

<4 ) ôn SAPE) ony? 
and the statement of the theorem holds, since +- < 1. 


x 


Further Notes 


Spectral clustering is well explained in the review by Von Luxburg, 2007. For more 
details on Louvain algorithm, we refer the reader to Blondel eż al., 2008; Good 
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et al., 2010. A nice application of Louvain algorithm (and more generally com- 
munity detection methods) to content recommendations on Reddit is presented in 
Jamonnak eż al., 2015. Additional deficiencies of Louvain algorithm (such as badly 
connected communities) were discovered by Traag et al., 2019, who also proposed a 
refinement of Louvain algorithm (called Leiden algorithm). We also mention the res- 
olution limit problem (Fortunato and Barthelemy, 2007), common to modularity 
maximisation methods. We finally note that while modularity methods are popular, 
thanks to the existence of fast algorithms and to the heuristic consideration link- 
ing modularity maximisation with maximum likelihood approach, care is needed 
since modularity maximisation is not strictly equivalent to likelihood maximisation 
(Zhang and Peixoto, 2020) and modularity algorithms are prone to over-fitting. 

Consistency of spectral methods in SBM were studied by Lei and Rinaldo, 2015, 
and further developed in Abbe eż al., 2020. For a recent overview of various appli- 
cations of spectral methods, we refer to Chen et al., 2021. Spectral methods are 
not the only ones to be consistent on SBM. For example, the consistency of SDP 
methods has been demonstrated (Hajek et al., 2016a,b; Guédon and Vershynin, 
2016; Amini et al., 2018; Fei and Chen, 2019). Moreover, community detection 
in other block models such as the Geometric Block Model has been recently stud- 
ied (Galhotra et al., 2018; Sankararaman and Baccelli, 2018; Avrachenkov et al., 
2021a). 

Many other community detection methods exist, for example: belief propaga- 
tion (Moore, 2017; Decelle et al., 2011), game-theoretic methods (Avrachenkov 
et al., 2018a; Moscato et al., 2019), methods based on the map equation (Rosvall 
and Bergstrom, 2008; Rosvall et al., 2009) and spectral methods based on other 
matrices such as the non-backtracking matrix (Krzakala et al., 2013) or the Bethe- 
Hessian (Saade et al., 2014). We also refer to the review by Fortunato, 2010 for 
more insights about the community detection problem. 

Finally, an important question not covered here is the estimation of the number 
of communities. For this topic, we refer the reader to Le and Levina, 2015; Bickel 
and Sarkar, 2016; Lei, 2016; Saldana et al., 2017; Hu et al., 2020. 
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Chapter 5 


Graph-based Semi-supervised Learning 


Semi-supervised learning (SSL) aims at achieving superior learning performance by 
combining unlabelled and labelled data. Since typically the amount of unlabelled 
data is large compared to the amount of labelled data, SSL methods are relevant 
when the performance of unsupervised learning is low, or when the cost of getting 
a large amount of labelled data for supervised learning is too high. Unfortunately, 
many standard semi-supervised learning techniques have been shown to not effi- 
ciently use the unlabelled data, leading to unsatisfactory or unstable performances 
(Chapelle et al., 2006, Chapter 4; Ben-David et al., 2008; Cozman et al., 2002). 
Moreover, the presence of noise in the labelled data may further degrade their per- 
formance. In practice, the noise often comes from a tired or non-diligent expert 
carrying out the labelling task. 

In this chapter, we will review some standard methods for semi-supervised graph 
clustering. In particular, we will study the performance of those methods in the case 
when the amount of labelled data is low and we will propose robust solutions in 
the presence of noisy labels. 
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General idea We assume that the node set V = [n] of a graph G = (V, E) is 
partitioned into K non-overlapping communities, represented by the latent com- 
munity labelling vector z € [K]”. It will be convenient to have a one-hot represen- 


tation of z, by defining a n x K ground-truth membership matrix Z € {0, 1}"**, 
such that 
l, if z; = k, 
Zik = , 
0, otherwise. 


As seen in Chapter 4, unsupervised community detection is the problem of 
recovering Z from the observation of G (and sometimes with the knowledge of 
K). We study here the noisy semi-supervised setting. More precisely, we assume 
that, in addition to the observation of the graph, an oracle gives us extra informa- 
tion about the cluster assignment of some nodes. We call those nodes the /abelled 
nodes, and we denote by £ the set of labelled nodes. Among those nodes, some are 
correctly labelled by the oracle, while some are mislabelled by the oracle. We denote 
by £o the set of mislabelled nodes and £; the set of correctly labelled nodes. In par- 
ticular, € = € U £1. The oracle can be represented by a matrix S of size n x K, 
whose rows Sj, are given by 


Z; ; if 1E li, 
5, = 1 Z, if icto (5.1) 
O1xK> if i¢f, 


where Z, is chosen in {z € {0,1} : |izllı = 1 and z Æ Z;}, and 01x denotes 
the row of K zeros. 

In other words, the oracle (5.1) reveals the correct cluster assignment of |€1| 
nodes, and a false cluster assignment for |€9| nodes. It reveals nothing for  — |€| 
nodes. The quantity |€o|/|€| is the rate of mistakes of the oracle (ż.e., the probability 
that the oracle reveals a false information given that it reveals something). The oracle 
is informative if this quantity is less than 1/2, which is equivalent to the intuitive 
condition |€;| > |€o|. In the following, we will always assume that the oracle is 
informative. 


Assumption 5.1. The oracle is informative, that is |1| > |€o]. 

Given the oracle S and the graph G, our strategy is to find a matrix X e R”*K 
from which we could predict the nodes’ labels. We will refer to the rows X% as 
classification functions, and a node ż will be classified in cluster 2; if 


Zi = argmax Xj. (5.2) 
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A standard framework is to define X as the solution of an optimisation problem of 


the type 


Xx = arg min C'(X, S), 
XEX 


where C(X, S) is a cost function, and ¥ is a subset of R’**., 
Notations Throughout this chapter, € denotes the set of nodes labelled by the 
oracle, while u = [m]\€ denotes the set of unlabelled nodes. The oracle is repre- 
sented by a matrix S € {0, 1}”**, defined as in equation (5.1), and the goal is to 
infer Z € {0, 1}”** after the observation of the graph G and the oracle S. 

We can assume, up to a reordering of the nodes, that the first || nodes are 
labelled by the oracle, while the remaining |u| ones are not. Accordingly, any matrix 
M e R”*” can be displayed in the block form 


Mee Mey 
M= ' 
ic a) 


Moreover, for any matrix X = (X) € R”**, X; stands for the row i of X, while 
Xe. 

X., stands for the column & of X, and we write X = ( — i 

Finally, Ze denotes the diagonal matrix whose element (ż, 2) equals 1 if7 € € and 


0 otherwise. 


5.1 Laplacian-based SSL Methods 


511 Label Propagation 


Presentation of the method Spectral methods for unsupervised community 
detection are based on the minimisation of quadratic functions, such as Tr (X TIX ) 
or Tr(X7LX) (see Section 4.1). Label Propagation extends this to the semi- 
supervised setting. In particular, the foundational papers (Zhu and Ghahramani; 
Zhu eż al., 2003) considered the following optimisation problem 


gP = argmin Tr (x"rx) . (5.3) 

Xe Rk 

Xe. = Se. 
The constraint Xp. = Se. forces the solution EL tobe equal to the oracle prediction 
on the labelled nodes. We first note that this hard constraint may not be suitable if 
the oracle is noisy, as it pushes the solution on the wrongly labelled nodes towards 
a wrong classification. Moreover, the constraints X TX = Ip or XTDX = Ix 
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(see again Section 4.1) of spectral clustering, which prevent obtaining flat solu- 
tions in unsupervised spectral methods, are absent here. Hence, the optimisation 
problem (5.3) relies only on the hard constraint Xp. = S¢. to prevent degenerate 
solutions. We will see later that this is a problem when the amount of labelled data is 
small. On the positive side, the following lemma provides a closed-form expression 
for RPP, 


Lemma 5.2 The solution X“” of the optimisation problem (5.3) is given by 
RE = Se., 
A -1 (5.4) 
I = (Za ~ (D7 'A) “a (D7 1A) u£ Ses 


where Jy is the identity matrix of size |u| x |u|. 


Proof: The constraint X¢. = Se. can be rewritten as follows: 
Xp. = Se <> Vhe [K] Vie €: (Xp— S = 0 
K on 
=> D D OGE Xx — Su)” = 0 
k=] i=1 


as i (ux -G Ge 5) =i, 
Thus, a Lagrangian associated to the minimisation problem (5.3) is 
Hct (x7Lx HGS -=s X= 5)) 


where u is a Lagrange multiplier. For every & € [K], the derivative with respect to 
Xp gives 


oH 
— = 2X IpX — S)). 
ae, = LOX + HX - S) 


Equating this derivative to zero leads to 
(L+ wl) X™ = us, 


while the derivative with respect to u leads to the constraint 7X4” = S. Using 
block notation, we can write 


LX = Lee Leu) (Xe\ _ (LeeXe. + LeuXu 
E Lut Luu Xu E Lue Xe. + LuuXu i 
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and therefore 


LeeXf? Eye? + pee = use., 
Lge A Lp XE? = 0. 


The constraint XET = Se. leads to the solution 


RLP = Sp, 
REP = (L)! Lue XE. 


We end the proof by noticing that since L = D — A and D is a diago- 
nal matrix, we have Lye = —Aye and (Lu)! = ((DUn- DI Aas = 
(Iu) — (D~!A) uu) | (Du)! Finally, (Duu) Aue = (D7"A),e, since D is diag- 
onal, and this ends the proof. 


We note from the proof of Lemma 5.2 that 


IX, = {* ifz € €, (5.5) 


0, otherwise. 


Finally, we present the following Algorithm 8. The computation of X from equa- 
tion (5.4) requires to solve a |u|-by-|u| linear system, which has in general a 
time-complexity of O (lul?) (less if the network is sparse). The ensuing paragraph 
presents a method to compute X in a decentralized and iterative manner. 


Algorithm 8: Label Propagation (Zhu and Ghahramani; Zhu et al., 2003). 
Input: graph G, oracle S. 
Output: node labelling z = (21, . . ., Zn) € [K]”. 
Process: 
e let_X asin equation (5.4); 
e fori =1,...,m let Z; be defined by classification rule (5.2). 


Return: 2. 


Interpretation as a propagation of the oracle labels We start by assigning 
to each node ia 1-by-K vector x e R!**, which is equal to the oracle prediction 
S;. for node 7. Then, at each time step ¢, the update of X is done as follows: 


e ifze €, then xen = xO (no update is done); 
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e ifi ¢Z €,then xe is the taken to be the average of X,” over the neighbours 

i , sl n y (£) 

of node 2, that is, X;. = J Èi Ag. : 
This can be interpreted as a propagation of the oracle’s information through the 
graph or as a consensus algorithm with the states of some agents fixed. The labelled 
nodes’ value remains equal to the oracle information, while an unlabelled node will 


sample the value of its neighbours and perform local averaging. In matrix form, we 
can write this procedure as follows: 


(0) c, 
ern _ X, ifz € €, 


(D71AX ©), , otherwise. 
The use of block notation leads to 
XP = (D74 XY (O74) 2, 
f= a 
with the initial condition X® = S. Since the matrix (D~!A) „u 18 Substochastic, 
X converges to X® satisfying the following system of equations 
XP = (DA) n A + (DA) XP: 
X = Se, 
whose solutions are 
Xe = (lu = (DA p) (D) St- 
X = Se, 
which is the same expression as the Label Propagation solution (5.4). 


Random walk interpretation Let y1, y2,... be a random walk on the graph, 
where the walker jumps from node 7 to a node j, where j is a neighbour of 7 chosen 
uniformly at random. The transition probabilities are given by 


A; 
Pi = P (y1 =7 |x =ù = i 


where d; is the degree of node i. In particular, p; = (D7 ‘A) and we note that 
P = D~'Ais the matrix of transition probabilities. 
Suppose that the walk starts from node , and that we end the walk as soon as we 


hit a labelled node. We denote yeng the final node. We denote by Xie the probability 
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that Sy, 4,4 = 1, that is the oracle assigned yenq to community $ by the oracle. Thus, 
we have 


Rir =P (Sic =1|n= i) . 
In particular, if 7 € £, then yenq = 7 and 


DN l, if Sp = 1, 
Xik = 
0, otherwise, 


which is equivalent to Xp. = Se.. By Markov property, we also have for any node i 


Pisar Tipps) = DP St 1 ges) oes 
j=! 


and therefore 
X = PX. 


Writing this equation in the block form and combining it with the constraint X¢. = 
Se. obtained before, leads to 


Re = (I — (DA),,) (DA) „p Se- 


Hence, we again recover the same expression as the solution of Label Propaga- 


tion (5.4). 


Interpretation as a heat equation Let us now interpret the Label Propagation 
as a solution of a heat equation. The evolution of temperature T of an isotropic 
material is governed by the heat equation 


oT 
— = adAT, 
Ot 


where A is the Laplacian and æ is the thermal conductivity of the material. At 
equilibrium, we simply have AT = 0. 

The oracle S plays the role of a heat bath. More precisely, we first fix a k € [K]. 
The labelled nodes 7 € £ behave as heat sources, whose temperature T; remains 
constant and equal to S; € {0,1}. The temperatures of the unlabelled nodes 
vary, as heat exchange takes place along the graph edges and is proportional to 
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the temperature difference between the edges’ endpoints. Therefore, 


Viet: Ty = Sz, 


. Tik Z 
View: >= > Ail Tt — Ty). 
j=! 
Since Deja Ai (Te — Ty) = (AT,); — diTi = — (LT); the temperature T4 
verifies at equilibrium 
Vieu: LT, = 0, 
while T; = S; for any labelled node 7. This can be rewritten as 
(LT), = 9, 
ih. = Se.. 


The above system is equivalent to equation (5.5), whose solution is equal to the 
solution of Label Propagation (5.4) (see Lemma 5.2). 


51.2 Label Spreading 


The SSL method of Label Spreading (Zhou et al., 2004) is based on the optimisation 
problem 


XS = arg min Chm, 
XeR”xK 


where the cost function C/ is defined by 
Ch) = (x7 cx pig- a 5)) 
After simple linear algebra transformation, we have 


K 
ros S, >» (2-5) +D Gu w | > 


k=1 ij 


where d; denotes the degree of node ż. 

The parameter À enforces a trade-off between the smoothness of the solution 
Xzs over the graph and the closeness of the solution to the oracle information S. 
The difference with the Label Propagation method is that the smoothness of the 
solution is imposed by the term Tr(X7 LX), which now includes the normalization 
by the node degrees. 
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As in the case of Label Propagation, XLS can also be expressed in a closed form. 
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Namely, for each k € [K], we have 


and hence 


Therefore, 


1 ac% 
2 0X, 


= LX +À (Xp — S4), 


KS = (I+ L)AS, 


Gan an N “as, 


A l O -1/2 p78 n 
I- D AD Sy: 
1+ 1+A 


KS = A—a)(f—-aD-"24D-"/)' S, 


where a = 4 € (0, 1). This gives Algorithm 9. 


Algorithm 9: Label Spreading (Zhou et al., 2004). 
Input: graph G, oracle S, parameter a € (0, 1). 
Output: node labelling Z = (2,..., Zn) € [K]”. 


Process: 
e compute the normalized adjacency matrix A = D~'/2AD~'/?, 


e let X45 be the solution of = aA) KLS =(1-a)S; 
e fori € [n], let 2; be defined by classification rule (5.2). 


Return: Z 


51.3 Generalized Laplacian 


As a follow-up to Label Propagation and Label Spreading methods, Avrachenkov 


et al., 2012 proposed a general class of cost functions 


C(x) = Tr (x? LD"-1x 41(X — 8)? DT! (x — 9) , 


where 1 > Oand0 < o < 1 are two hyper-parameters. The solution of the 


minimisation problem 


Xe = arg min CX) 
XeR”xK 
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is given by 
X = (1-a)()—-aD~ AD") S, 


where a = A/(1 + A). Since the computations are similar to the computations 
in the previous sections, we omit them and refer the reader to (Avrachenkov et al., 
2012, Proposition 2) for details. Different normalizations are obtained by different 
choices of ø . In particular, 


e o = 1 corresponds to Label Propagation; 
e o = 1/2 corresponds to Label Spreading; 
© o = 0 corresponds to the PageRank-based method. 


5.1.4 Numerical Performance of the Laplacian-based Methods 


Choice of hyper-parameter a for Label Spreading We first investigate the 
effect of a on the classification performances. We choose two datasets for which 
we saw that unsupervised spectral clustering failed: DBLP and Cora. We let 2% 
of the nodes be labelled by the oracle, and we plot in Figure 5.1 (blue curve) the 
accuracy as a function of a. We see that the accuracy increased when a increases, 
but suddenly dropped if a becomes too close to one. We also notice that the best 
choice of a can be made by looking at the modularity of the predicted partition 
(red curve in Figure 5.1), as the modularity closely follows the accuracy. 


Noisy oracle We now study the effect of noise on classification performance. We 
keep the same two datasets, now with 5% labelled nodes. We define the noise as 
being the proportion of mistakes made by the oracle. The results are plotted in 
Figure 5.2. As expected, the noise deteriorates the classification performance. 
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(a) DBLP dataset. (b) Cora dataset. 


Figure 5.1. Effect of the choice of parameter a on the performance of Label Spreading on 
two data sets. The blue curve gives the accuracy (computed with respect to the ground 
truth labels) and the red curve the modularity (computed using only the observed graph 
and the predicted labels). Results are averaged over 100 realisations. In each realisation, 
we randomly chose 2% of the nodes to serve as labelled nodes. 
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Figure 5.2. Effect of a noisy oracle on the classification performances of Laplacian-based 
methods. Results are averaged over 100 realisations, with 5% of the nodes being labelled. 
(We choose a = 0.8 for Label Spreading and Generalized Laplacian, and o = 0 for Gener- 
alized Laplacian, which corresponds to the PageRank-based method.) 
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Figure 5.3. Effect of small labelled data on the classification performance of Laplacian- 
based methods. Results are averaged over 100 realisations. 


Small amount of labelled data We finish this section by emphasizing the 
importance of the problem of small amount of labelled data. Figure 5.3 shows that 
the classification accuracy heavily degrades when the number of labelled nodes per 
class becomes too low. 


5.2 Learning with Small Amount of Labelled Data 


5.21 The Problem of Small Labelled Data 


Numerical experiments showed that at very low labelling rates, the performance 
of the SSL methods becomes poor. We will explain this phenomenon using the 
random walk interpretation of Label Propagation algorithm (see Section 5.1.1 for 
more details on Label Propagation). 

Let y1,...53-.. be a random walk on the graph starting at node 7. Let t = 
inf >1{y, E€ €} be the first time that the walk hits a labelled node, and we recall that 
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Rend =P (Sy, Ł=l1]|y = i). In other words, X, Xie is the probability that the first 
labelled node reached by the walk (started from node 7) has a label &. 

If the number of labelled nodes is small and the graph is large, then the time t 
will be large. In particular, if t is larger than the mixing time of the walk, then the 
distribution of y; is very close to the invariant distribution z of the random walk, 
given by 


dj 
De a 


This means that the chain has forgotten its starting point i, and thus XZ” is a 


Tj = 


constant independent of 7, which defeats the goal of classification. 
Let us formalize this intuition and try to mitigate the problem. We first note 
that, for k < T, 


1 
—— IX, = 0, 
d(yp-1) Vk-1 


E [Xn — Xy_ lya = 


since LX = 0 on the unlabelled nodes (see equation (5.5)). Thus, Xy,,- ++ ,Xy5+ +> 
is a martingale. Since T is an almost surely bounded stopping time, Doob’s optimal 
stopping theorem then implies that 


SL] = EL]. 
Since yo = iand y; € £, we have E[X,,] = X; and X, = S,,, and thus 
D jet GSE 
Xa DY) GR = Sa (5.6) 
jet Èd 


Hence, the first order approximation of x i is the same for all unlabelled node i, 
and potential differences only come from second-order terms. A first improvement 
of Label Propagation is thus to replace the classification rule (5.2) by 


Ži = arg max (Xa — cp) ; 
ke{1,...,k} 


GS, 
where cp = x. = r Equivalently, one could “shift” equation (5.5) and solve 


Sp— cp ifief, 
LXik = 
0, otherwise. 
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Rewriting the above equation as 


we can interpret it as a heat equation, where heat sources and sinks are placed at 


the labelled nodes. 


5.2.2 Poisson Learning 
Dies St 


consider the equation 


Let 5, = # Following the previous remarks, Calder et al, 2020 proposed to 


LXip = > (Sje — S) bj (5.7) 
jet 


such that >7”_, d;X;z = 0. Equivalently, this accounts to solve the following opti- 
misation problem (Calder et al., 2020, Theorem 2.3) 


argmin ‘Tr (x7zx) — (S — OM 


XeR”*K 
Dji di X= 
where 
= x, iff e €, 
Sik = F 
0, otherwise. 


In particular, while Label Propagation handles labelled data by placing hard con- 
straints, Poisson learning adds a loss term to the energy function. 


Random walk interpretation As the labelled nodes are now source and sinks 
of the heat equation, the random walk interpretation differs. Let us denote by 
y, +++}. a random walk on the graph starting from node j € £. Each time 
the random walk hits node i, we record the shifted label S;. — 5;.. This defines the 


quantity 


WP = 21> 1S 6-391 (6 =3) 


t= o 4: * jel 


The following lemma gives an iterative expression for XP), 
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Lemma 5.3. We have 


(T+) _ yT, 1 E) os.. T 
Xe? = Xe +a (DS — Sp) dj — (1X ) 
* \ jel 


Furthermore, assume G is connected and the Markov chain induced by the random 
walk is aperiodic. Then limT— oo X D — X, where X is the unique solution of the 
Poisson equation (5.7). 


Proof. We first write 


= > (Se — Se) Gr), (5.8) 
jee 


where Gr (i,j) = Zz E be 1 (i = i) | = 4D oP (i = i) is the normal- 


ized Green function. Using 


P(x = i) = Ero = ily. = u) P (y =u), 
u=1 


we have 


n 


T n 
diGr(i j) = 6 + >>, =P Ce — u) 


t=] u=1 


n -1 


= DEY P=.) 


u=1  “ t=0 


n 


bg + > wuiGr-1 j), 


u=1 


and therefore 
di (GT) — Grip) + LGr-1Gj) = dy. 


Combined with equation (5.8), this establishes 


di (X Xe?) = D Se Se) d- (LX), 
jel 


Then, summing both sides of this equation over i = 1,--- , n, leads to 


YAK? = Dax, 
i=l i=1 
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and therefore >7”_, dX) = ae dX for all T. Since 


di XO = > (Sie — Six) ĉj» 
jet 


we obtain >., dx) = 0. Finally, let vi? = d(x — Xip). We have 


(T) | WD YH (T-1) 
Vig oe: Vip > 


and 2 ve = 0 for all T. Since the random walk is aperiodic and the graph 
is connected, 


where m; = seg - is the chain’s stationary distribution. 
= y 
a 


In fact, Lemma 5.3 also provides an iterative numerical procedure for computing 
the solution. The precedure is formally described in Algorithm 10. 


Algorithm 10: Poisson learning (Calder eż al., 2020). 


y’** number of iterations T. 


Input: graph G, oracle S € {0,1 
Output: node labelling z € [K]”. 
Process: 

e let L be the graph’s standard Laplacian and D the graph’s degree matrix; 

© let £ be the set of labelled nodes, and let S = S diag (5) where 

$= (,...,5¢) with % = pj Dice Siti 
e fort=1,---,T do: X & X + D7! (S — § — LX); 
e fori = 1,...,n let Z; be defined by classification rule (5.2). 


Return: Z 


5.2.3 Numerical Experiments 


To assess the performance of Poisson learning in a regime with extremely low 
amount of labelled nodes, we reproduce the results of Calder et al., 2020. They 
consider MNIST (LeCun eż al., 1998) and Fashion-MNIST (Xiao et al., 2017) 
datasets, on which they trained auto-encoders to extract important features from 
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Figure 5.4. Performance of Poisson learning on MNIST and fashion-MNIST datasets, 
when the number of labelled nodes per class is extremely small. Results are averaged 
over 10 realisations, and error bars show the standard error. 


the data. More precisely, they used variational auto-encoders with 3 fully con- 
nected layers of sizes (784,400,20) and (784,400,30), respectively, followed by a 
symmetrically defined decoder (Kingma and Welling, 2014). The auto-encoder 
was trained for 100 epochs on each data set. Then, a 10-nearest neighbours graph 
is build with Gaussian weights w; = exp (—A4llx; — xl? Jof) where x; is the 
latent variables for image 7 and ø; is the distance between x; and its 10-nearest 
neighbour. The results are shown in Figure 5.4. In particular, even with only one 
labelled node per class, the performance of Poisson learning remains extremely 


high. 


5.3 Other Methods 


5.3.1 Constrained Spectral Clustering 


The goal of this method is to directly incorporate semi-supervised information into 
spectral methods. Specifically, from the oracle information S, we can construct the 
must-link/cannot-link matrix Q as follows: 


1, if 7,7 e£ and Si. = j> 
Qij = Qi = =, if i,j € € and S; + $. 


0, otherwise. 


We note that must-link/cannot-link information can also be given to us directly. In 
fact, for experts it is often easier to conclude if two items are similar or not, rather 
than to attribute items to classes. 
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For a membership matrix Z € Zy,x, the quantity 


Tk (27a) = > SQ ZinZje 


k=1 ij 


measures how well the membership matrix Z respects the oracle information. 
Indeed, this quantity increases by 1 if Q = 1 and Z assign nodes 7,7 in the same 
cluster, and decreases by 1 if Q; = —1 but 7 and j are assigned to the same cluster. 
Therefore, Tr (Z7 QZ ) equals the number of satisfied must-link/cannot-link con- 
straints minus the number of violated constraints. Rather than asking for all the 
constraints in Q to be verified, we can impose the lower bound 


i (27@) >a 


for some a > 0. Such constraint can be incorporated directly into the normalized 
spectral clustering minimisation problem, and it leads to the following optimisation 
problem 


argmin ‘Tr (uru) ; 
UeR™** 
UT DU=Ik 
Tr(U? QU) >a 


which after the change of variable X = D!/*U can be rewritten as 


argmin ‘Tr (xex) 5 (5.9) 
XeR”*K 
XTX=Ik 

T(XTÕX) za 


with Q = D712 QD, 


Lemma 5.4. Let X be solution of (5.9). The rows of X are solutions of a generalized 
eigenvalue problem LX.4 = 4 (Q — P) Xx for some B. 


Proof. The Lagrangian of the minimisation problem (5.9) is 
Tr (xex) =i (T: (x"Qx) = a) —Tr (ce (x7x = Ix)) 


where 4 € R is the Lagrange multiplier associated to the constraint Tr (X7 QX) > 
aand T e R*** 
ated to the constraint XTX = Ig. Note that up to a change of basis, we can choose 
T to be diagonal. Then, according to the KKT Theorem (Kuhn, 1982), any feasible 


is a symmetric matrix whose elements are the multipliers associ- 
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optimal solution of problem (5.9) must verify 
stationarity: LX —1QX -XT = 0, 
primal feasibility: Tr (x7Qx) > aandX'X = Ix, 
dual feasibility: Ist 


complementary slackness: A (Ti (x 1 OX ) — a) 0. 

The complementary slackness requirement either implies 4 = 0 or Tr (XT QX) = 
a. If 4 = 0, then the stationarity requirement would reduce the problem to the 
standard (unconstrained) spectral clustering. Thus 4 # 0, and Tr (X TOX) =q, 
and the KKT conditions become 


LX -—-1QX-XT = 0, 
T (x7Qx) =~ 
XTX = Ik, 

A> 0. 


Since IT is diagonal, the first equation is equivalent to 
(L-T) Xr = AQXp, 


which is a generalised eigenvalue problem for a given T4. We end the proof by 


letting B = — Le. 


Based on Lemma 5.4, and following Wang and Davidson, 2010 and Wang et al., 
2014, we propose the following procedure: 


(i) find the vectors v1, --- , vp solutions of Loz = Az (Q — BL) v, associated 
to A, > 0; 
(ii) given all the feasible eigenvectors v1, +++ , vp, pick the top K — 1 in terms 


of minimising v? Lv, and let those K — 1 vectors form the columns of X. 


Since P < Ax, there is at least K —1 solutions of the generalised eigenvalue problem 
associated with positive eigenvalues. Furthermore, the solutions are real vectors, 
since £ and Q—£/, are Hermitian matrices. Finally, this procedure is justified, since 
X verifies the KKT conditions derived in the proof of Lemma 5.4 if (K —1)f < a. 
Indeed, since £ is positive semi-definite, we have v? Lv > 0 with equality only 
for v x 1,. Hence Tr (X7 LX) > 0. Moreover, Tr (X7LX) = ie Le = 
dp 24X (Q — Bln) X > Tr (X7QX)—-(K—-1)f, and Tr B < a. We summarise 
the procedure in Algorithm 11. 
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Algorithm 11: Constrained spectral clustering (Wang and Davidson, 2010; 
Wang et al., 2014). 
Input: graph G, must-link/cannot-link matrix Q, number of clusters K, 
parameter f. 
Output: node labelling z € [K]”. 
Process: let £ be the normalised Laplacian of G, and let 
Q = D712 QD", 
if 8 >AK-] (Q) then 
Return: 9. 


else 
e letvi,- +- , ¥» be solution of the generalised eigenvalue problem 


Lv=) (Q — B) v associated with eigenvalues A > 0; 
e let V* = argminyepoxx—1 Tr (V7 LV) where the columns of V are a 
subset of the feasible eigenvectors computed previously. 


| Return: = k-means(D~!/*V*, K) 


5.3.2 Laplacian Regularization 


Previous methods minimise a cost function, which involves a smoothness term 
Tr(X7 MX), where M is typically the graph (standard or normalised) Laplacian, 
a penalty term penalising any differences between X¢. and S¢., and eventually a 
regularisation term. 

In contrast, Laplacian regularization (Belkin and Niyogi, 2002) enforces the 
smoothness by constraining the vector X to belong to the eigenspace of the graph 
Laplacian L spanned by the eigenvectors associated to the p smallest eigenvalues. It 
then finds the linear combination of these eigenvectors that minimises the mean- 
squared error between X and S on the labelled nodes. 

Let v1,..., Vp be the eigenvectors of L associated to the p smallest eigenvalues, 
normalized so that lv; = 1. The solution X = (xj) ie[n],he[K] is written as 
Xik = 2 bgkq(i) where v,(z) stands for the i-th entry of the eigenvector vg. 
In matrix form, this gives X = VB where V = (v,...,v)) and B € RK, 
The mean-squared error between the labelled nodes and their corresponding oracle 
value is then 


2 


K K p 
MSE (X¢., 52.) = > >, @a—su)? = DD | Dd bathgi — st 


k=1 iel k=1 ie€ \g=l 
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Let b = b.p and 3 = Sx be the k-th columns of B and S, respectively. The solution 
to the least square problem 


2 


P 
arg min ` > bavgi —5; 


beR? iel \q=l 


ee a =j m na S ae 
tis given by 6 = (Vi Ve.) Vg.3e. Therefore, the solution X/% minimising the 
mean-squared error is 


a —1 
RIR = (VE Ve) Vese 
This is summarised in Algorithm 12. 


Algorithm 12: Laplacian regularization (Belkin and Niyogi, 2002). 

Input: graph G, oracle S, number of eigenvectors p. 

Output: node labelling z € [K]”. 

Process: 
e compute v1, .. . , Vp the orthonormal eigenvectors associated to the p small- 

est eigenvalues of the graph Laplacian L = D — A; 

© let V = (v1 ..., vp) E€ R””P; 
© Jet XLR = VBER where BEF is the solution of (eV) BER = S; 
e fori = 1,..., n let 2; be defined using the classification rule (5.2) on RLR, 


Return: 2. 


5.3.3 €!-based Methods: Sparse Label Propagation 


Previous methods consist in minimising a cost function based on the €?-norm. 
Instead, Jung eż al., 2019 proposed to measure the smoothness of a signal x € R” 
on a graph via its total variation 


Ixlrv = > ay [xi — x|. 
ty 


If we let z? € [K]” be the community labels, and £ be the set of labelled nodes, 
then one can state the following optimisation problem 


x = argmin Saj |x; — x|. (5.10) 
xeR” ij 


Viel: xj=2) 
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We can recover the predicted communities Z by truncating x € R” to z € [K]”. 

We recall that the standard Label Propagation (5.3) consists in minimising 
x! Ix = >; jti (xj — x) under the oracle constraints. Hence, problem (5.10) 
resembles Label Propagation, except that it involves the €!-norm of signal differ- 
ences along the graph edges. Therefore, we expect it to accurately learn signals 
which abruptly vary over few edges (which is indeed the case of community labels). 
In contrast, an €7-norm based methods like Label Propagation might smooth out 
such abrupt variations. 

Finally, since the optimisation problem (5.10) involves nondifferentiable func- 
tion, it makes the theoretical analysis harder and rules out some popular methods 
like gradient-based ones. We refer the reader to Jung et al., 2019 for the theoreti- 
cal analysis and for the details of algorithmic implementation, and we simply state 


Algorithm 13. 


Algorithm 13: Sparse Label Propagation (Jung et al., 2019). 
Input: graph G = (V, E), labelled set £, initial labels (29); , and number of 


iterations “iterations: 
Output: predicted node labelling z € [K]”. 
Eoo _ O — z 30) —~g (0) — = 
Initialize: let k = 0, 27 = ze, 2°? = On, 9” = Ons Yi Fay Aij 
1 E [7] and Aaj) = F for (i) E E. Define i diag(y h 
A = diag (AG) 
G. 


for 


GEE and T € {0, 1}!4!*” the incidence matrix of 


Update: while < iterations do 
glktl) — z% — Th; 


ae) = 29 forallie £; 
y=y+ od (224+) i, 2); 
MG) = median £! all - (ij) € E; 
= k+1 
a= (\- any + BT arn 
Lk=k+1. 
Return: 2. 


5.4 Bayesian Approach to SSL and Its Theoretical 
Analysis 


This section studies theoretical properties of Bayesian estimators for DC-SBM 
graphs in the SSL setting. For the simplicity of exposition, we mostly consider the 
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case of K = 2 clusters. The prior latent block structure is given by a random vector 


z? = (29, ae , z?) with z? ~ Uni ({—1, 1}). The oracle is then represented as a 


vector s € {0, —1, 1}”, whose entries s; are independent and distributed as follows: 


29, with probability 1, 
A —29, with probability 7, (5.11) 
0, otherwise. 


5.41 MAP Estimator for DC-SBM with a Noisy Oracle 


Proposition 5.1 (Adapted from Avrachenkov and Dreveton, 2020). Let A be the 
adjacency matrix of a homogeneous Poisson SBM as in (2.7), with Win > Oow and 
s be an oracle information, defined by (5.11). The Maximum A Posteriori (MAP) 
estimator of the true class labelling is given by 


ZMap = argmaxP(z|A,s), 


ze[K]” 
which is equal to 
arg min Cut(A, z) — Tn (z)m2(z) + Alice £: zi # sil, (5.12) 
ze[K]” 
h Min Oout log T d n k . h b 
where t = p > A = g a0 np(z) = X; l(a = k) is the number of 


nodes assigned to class k by labelling z. 


The term 7 (z)n2(z2) = nı (z)(n — nı (z)) is maximal when 71 (2) = 5, Łe., 
when z predicts two clusters of same size. Therefore, the MAP estimator in the SSL 
context shows a trade-off between two unsupervised terms (minimising the graph’s 
cut and having balanced community sizes) and a semi-supervised term (minimising 
the number of disagreements between the oracle and the prediction). 


Proof. First, we have from Bayes formula 
P(z|A,s) x P(A|s,z)P(z|s), 


where the proportionality hides a term P (A| s) independent of z. We established 
at the end of the proof of Proposition 4.4 that 


1 Win Win — Oou 
logP(A|z) = a (a = a= 04) 1 (z; = zj) + C, 
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where C is a constant independent of z. Finally, the oracle information, given by 
the term P(z |s), is equal to 


7 POs | 2 
Pel) = [[ Pe) 
i=1 a 


_ ( A ) |{iee: zi=si)| ( No ) |{iee: aitsi}| (:) n 
m + No m + No 2 
| tice : aiesi}| |e] 1\” 
_ (*) ( m ) G) , (5.13) 
yı nı + No 2 


where we used {g e l: z= si}| + {i Eel: 24 si} — le] in the last line. 


5.4.2 Continuous Relaxation 


The MAP estimator derived in Proposition 5.1 can be rewritten as 
gMAP — arg min —z! (4 — eiria) z+4A(s— Pz) (s — Pz), 
ze{—1,1}” 


where P is the diagonal matrix whose element (i, 7) is equal to 1, if7 € £, and it is 
equal to 0, otherwise. First, we simply notice that 


: 1 1 
letl: z; £ sil = 126a = z 6— Ps)’ © Pe). 
iel 
We then perform a continuous relaxation mirroring what is commonly done for 
unsupervised spectral methods (Newman, 2013) and discussed in Section 4.4.2, 
namely, we consider the following optimisation problem 


Te awh (-x74:x Fim Ge Px)) , (5.14) 
xeR” 
T Ki 
where A, = A — ci and kK = (kj,...,K,) is a vector of positive entries. 


For the simplicity of the derivations, we choose to constrain x to the hyper-sphere 
lix]? = 2 by letting x; = 1, but other choices would lead to a similar analysis. 
In particular, in the numerical Section 5.4.4 we will compare this choice with a 
degree-normalization approach (x; = d;). 

We further note that for the perfect oracle the corresponding relaxation is 


X =argmin (—x7 4.x) ‘ (5.15) 
xeR” 
xe=S¢ 
I|x||?=n 


Bayesian Approach to SSL and Its Theoretical Analysis 131 


Given the classification vector X € R”, node i is classified into cluster 2; e {-1, 1} 
such that 

a 1 if X;>0, 

z= (5.16) 


—1 otherwise. 


Let us solve the minimisation problem (5.14). By letting y € R be the Lagrange 
multiplier associated with the constraint ||x||? = n, the Lagrangian of the optimi- 
sation problem (5.14) is then 


—xT A,x + Als — Px)! (s — Px) — y (r: — n) : 


This leads to the constrained linear system 


(5.17) 
xx = n, 


[= +P =y) x = ìs, 


whose unknowns are y and x. 

The exact optimal value of y can be found explicitly following Gander et al., 
1989. Firstly, we note that if (y1,x1) and (y2,x2) are solutions of the sys- 
tem (5.17), then 


Ca) -C) = ooe læ — all, 


where C(x) = —xT Arx + A(s — Px)" (s — Px) is the cost function minimised 

in (5.14). Hence, among the solution pairs (y , x) of the system (5.17), the solution 

of the minimisation problem (5.14) is the vector x associated with the smallest y . 
Secondly, the eigenvalue decomposition of —A; + AP reads as 


—A, +4P = QAQ’, 


where A = diag(d),...,6,) with ô < --- < 6, and QTQ = In. Therefore, 
after the change of variables u = Q’ x and b = 1Q"s, the system (5.17) is trans- 
formed to 


Thus, the solution X of the optimisation problem (5.14) verifies 


(-A, + AP — yl) X = As, (5.18) 
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where y, is the smallest solution of the explicit secular equation (Gander et al., 1989) 


n : 2 
z(h) -a= 0. (5.19) 


i=1 


We summarise this in Algorithm 14. Note that for the sake of generality, we let A 
and t be hyper-parameters of the algorithm. If the model parameters are known, 
we can use the expressions of À and t derived in Proposition 5.1. The choice of A 
and t is further discussed in Section 5.4.4. 


Algorithm 14: Semi-supervised learning by a MAP relaxation. 
Input: adjacency matrix A, oracle information s, parameters T and À. 
Output: node labelling z € [K]” = (21,..., Zn). 
Process: 
e let y* be the smallest solution of equation (5.19); 
e compute X as the solution of equation (5.18); 
e fori=1,..., let 2; be defined using (5.16) on £. 


Return: 2. 


5.4.3 Upper Bound on the Number of Misclassified Nodes 


In this section, we derive an upper bound on the number of unlabelled nodes mis- 
classified by Algorithm 14 on a DC-SBM. We then specialise the results for some 
particular cases. We will assume that given (Pins Pour, 0, Z), the graph adjacency 
matrix A = (aj) is generated as 


TEE f (0:0ipin) » if a= % 


(5.20) 
Ber (0:0; dut) , otherwise, 


fori < j, and A; = 0. Furthermore, we suppose that z; ~ Uni ({—1, 1}), and that 
the entries of 0 are independent random variables satisfying 0; € [Omin, Omax] with 
EO; = 1, Amin > 0, and a. max( Pin Pout) < 1. 

For an estimator 2 € {—1, 1}” of z, the number of mis-clustered nodes is simply 


the Hamming distance between the two sequences 2 and z, defined as 


n 


Aimee) = > 1 xx), 


i=1 


and the proportion of mis-clustered nodes is Siam (G2) Note that, unlike in the 
unsupervised clustering, we do not take a minimum over the permutations of the 


Bayesian Approach to SSL and Its Theoretical Analysis 133 


predicted labels since we should be able to learn the correct community labels from 
the informative oracle. 


Theorem 5.5 (Avrachenkov and Dreveton, 2020). Consider a DC-SBM with a noisy 
oracle as defined in (5.20),(5.11). Let d= 5 Pin + Pour) and à = F (Pin — Pout). 
Suppose that t > pow and let Z be the output of Algorithm 14. Then, the proportion 
of misclassified unlabelled nodes verifies 


diam (Zw Zu) < C (= fo) (¢ + J 1 
n j Pin — Pout A (m1 + no) (m — no) d 


In the following, the mean-field graph refers to the weighted graph formed by 
the expected adjacency matrix of a DC-SBM graph. Moreover, we assume without 
loss of generality that the first 5 nodes are in the first cluster and the last 5 are in 


the second cluster. Therefore, EA = ZBZ! with B = Ce 4 3 and Z= 


out Pin 


1 0 . : ; 3 

( ee ‘) . In particular, the coefficients 6; disappear because EO; = 1. We 
07/2 14/2 

consider the setting where diagonal elements of EA are not zeros. This accounts for 


modifying the definition of DC-SBM, where we can have self-loops with proba- 


bility pin. Nonetheless, we could set the diagonal elements of EA to zeros and our 


results would still hold at the expense of cumbersome expressions. Note that the 
Pint Pout Pin Pou, 
2 


anda = net 


matrix EA has two non-zero eigenvalues: d = n 


Proof of Theorem 5.5. We prove the statement in three steps. We first show that 
the solution X of the constrained linear system (5.17) is concentrated around the 
solution x of the same system for the mean-field model. Then, we compute x and 
show that we can retrieve the correct cluster assignment from it. We finally conclude 
with the derivation of the bound. 

(i) Similarly to (Avrachenkov et al., 2018c) and (Avrachenkov and Dreveton, 
2019), let us rewrite equation (5.18) asa perturbation of a system of linear equations 
corresponding to the mean-field solution. We thus have 


(EL + AL) (x+ Ax) = As, 


where L = —A, + AP = paln, Ax = X —xand AL := Ĉ — EČ. 
A perturbation of a system of linear equations (A + AA) (x + Ax) = 6 leads to 


the following sensitivity inequality (Horn and Johnson, 2012, Section 5.8): 


|| Ax a IAAI 
ll IAI 
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where ||.|| is the operator norm associated to a vector norm ||.|| (we use the same 
notations for simplicity) and x (A) := IAT} || - | Al] is the condition number. In our 
case, the above inequality can be rewritten as follows: 


| , (65.21) 


lill 


|£ -x| < | 


employing the Euclidean vector norm and the spectral operator norm. The spectral 


study of E £ (see Corollary B.2 in Appendix B.1.1) gives: 


ey 


where t} is defined in Corollary B.2 in Appendix B.1.1 and 7, is the solution of 


1 1 


min {[A| fhe Sp(E £)} E =t} = Fr 


equation (5.19) for the mean-field model. Lemma B.3 in Appendix B.1.2 leads to 


ey 


The last ingredient needed is the concentration of the adjacency matrix around 


1 
APG 


< 


(5.22) 


its expectation. We have 


|E-BL| < Me 94) ill +IA-EAl < |y- 7 |+ I4- EAI. 


Proposition B.2 in Appendix B.1.3 shows that 


27 (& + 4)? ) 


J2./m F nom — no)a2A 


Moreover, when d = Q (log n), we have ||A — E A|| = O (va) (Feige and Ofek, 


2005). If d = o(log7), the same result holds with a proper pre-processing on A, 
and we refer the reader to (Le eż al., 2017) for more details. To keep notations short, 
we will omit this extra step in the proof. Using this concentration bound, we have 


|ē- EÂ 


| < (c+ 27 (& + 4)? ) 7 
T V2/m F nol — no)a2d 
( , 7) +a) Vd 
< (C+ = 
V2) &@à /mi F 0 (Mm — No) 
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for some constant C’. Let C = C’ + A By combining the above with inequal- 
ity (5.22), the inequality (5.21) becomes 
IR-  cQ+a va vā 
Ixl 7 a? /m F no (m — no) 


(ii) Node z in the mean-field model is correctly classified by decision rule (5.16) 


(5.23) 


if the sign of x; equals the sign of z;. Corollary B.5 in Appendix B.2 shows that this 
is indeed the case for the unlabelled nodes. 

(iii) Finally, for an unlabelled node 7 to be correctly classified, the node’s value £; 
should be close enough to its mean-field value x;. In particular, the part (ii) shows 
that if |X; — x;| is smaller than some non-vanishing constant f, then an unla- 
belled node 7 will be correctly classified. An unlabelled node i is said to be £- 
bad if he —x;| > B. We denote by Sg the set of B-bad nodes. The nodes 
that are not -bad are as. ony classified, and thus dHam (2m Zu) < ISgl. 
From |R —x|° > he ~ xi", it follows that |£ -x| > B?ISgl. Thus, 


using (5.23) and the norm constraint \|x||* = 7, we have 


sl < a (i) e 


ıı — No ad 


Dies; 


d 2 Pint Pout 
a Pin—Pout ` 


for some constant C. We end the proof by noticing that 


Corollary 5.6 (Almost exact recovery in the diverging degree regime). Consider a 


DC-SBM such that d > 1, ma = O(1), and sno F m (m — 10) > ~= 


Suppose thatt > Pou and 2 Z a. Then, Algorithm 14 correctly classifies almost all 
the unlabelled nodes. 


Proof. With the corollary’s assumptions (41 — yo)?d —> +00 and ATA = O(1). 
Thus, by Theorem 5.5 the fraction of misclassified nodes is o(1). 


The quantity (71 — 70) is the expected difference between the number of nodes 
correctly labelled and the number of nodes wrongly labelled by the oracle. In par- 
ticular, Corollary 5.6 allows for a sub-linear number of labelled nodes since yo and 
1 can go to zero. 


Corollary 5.7 (Detection in the constant degree regime). Consider a DC-SBM 
such that pin = ® and pow = S where Cin, Cour are constants. Suppose that 
Jno + yı(nı — No) is a non-zero constant, and lett > 2pour and A = 1. Then, 


ae 2 
for (Gin= Cour)” bigger than some constant, w.h.p. Algorithm 14 performs better than a 


Cint Cout 
random guess. 
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Proof. According to Theorem 5.5, the fraction of misclustered nodes is smaller than 


(cin=6o y 2C ats 2 : : 
oe ae (444)", which is lower bounded by a 


4 when is larger than 


constant. 


(a= mtoa 


The quantity can be interpreted as the signal-to-noise ratio. It is unfor- 
tunate that Caly 5.7 does not allow us to control the constant in the statement 
of the corollary. This constant comes from concentration of the adjacency matrix. 
Similar remarks were made in (Le et al., 2017) for the analysis of unsupervised 


spectral clustering in the constant degree regime for SBM graphs. 


5.4.4 Numerical Results 


This section presents numerical experiments both on synthetic data sets generated 
from DC-SBMs and on real networks. In particular, we discuss the impact of the 


oracle mistakes (defined by the ratio TET ) on the performance of the algorithms. 


Choice of 2 and t Let us denote by oj and o2 the largest and second largest 


log 5 
T ing 12D if yo Æ 0, and å = 


O89) 


eigenvalues of A. We choose t = 406, +02) and À = 


Ssh) otherwise. The heuristic for this choice is as follows. For a SBM graph, we 


01—02 
have o] © 5 (Pin + Pout) and o2 © 5 (Pin —Pout)> hence 4 (01 +02) = = 2Pin > Pout 
log 7+ log + 
eita? d 
lo 01—02 Pout 


which is indeed close to the expression of À derived in Proposition 5.1 if Pins Pour = 


o(1). 


Choice of relaxation We first compare the choice of the constraint in the con- 
tinuous relaxation (5.14). Specifically, we compare the choice X; x? = n (refered 


log 


and t verifies the condition of Theorem 5.5. For 2, we have 


to as standard relaxation) versus Ù; dix? = 2|E| (refered to as degree-normalized 
relaxation). This leads to two versions of Algorithm 14, whose cost obtained on 
SBM graph with a noisy oracle is presented in Figure 5.5. In particular, we observe 
that the normalized choice leads to a smaller cost. Therefore, in the following we 
will only consider the version of Algorithm 14 solve the relaxed problem (5.14) 
with constraint >”, djx? = 2|E| instead of $; x? = n, as it gives better numerical 
results. 


Experiments on synthetic graphs We first consider clustering on DC-SBM. 
We set n = 2000, pin = 0.04 and pour = 0.02. We consider three scenarios: 


e In Figure 5.6(a) we consider a standard SBM (6; = 1 for all 2). 
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pin 


Figure 5.5. Cost in Algorithm 14 with the standard and degree-normalized versions of 
the constraint, on 50 realizations of SBM with n = 500, pour = 0.03 and 50 labelled nodes 
with 10% noise. 
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Figure 5.6. Average accuracy obtained by different semi-supervised clustering methods 
on DC-SBM graphs, with 2 = 1000, pin = 0.04 and pour = 0.02 with different distributions for 8. 
The number of labelled nodes is equal to 40. Accuracies are computed on the unlabelled 
nodes and are averaged over 50 realisations; the error bars show the standard error. 


e In Figure 5.6(b) we generate 6; according to |N (0, 07)|+1—o./2/z where 
IN (0, o)| denotes the absolute value of a normal random variable with mean 
0 and variance o°. We take o = 0.25. Note that this definition enforces 

EO; =]; 

© In Figure 5.6(c) we generate 0; from Pareto distribution with density function 
f) = A(x > m) with a = 3 and m = 2/3 (chosen such that EO; = 1). 


We compare the performance of Algorithm 14 (called map-relaxed in the fig- 
ures) with Poisson learning (Algorithm 10) and constrained spectral clustering (Algo- 
rithm 11, abbreviated as csc). Results are shown in Figure 5.6. While map-relaxed 
and csc limit the decrease of accuracy when the noise increase, the performance of 
csc is quite poor on those synthetic data sets. Furthermore, we notice that Poisson 
learning also gives poor result on the synthetic data sets, and its performance further 
deteriorates with noise. 


Experiments on MNIST data set As a real-life example, we perform simu- 
lations on the standard MNIST data set (LeCun eż al., 1998). As preprocessing, 
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Figure 5.7. Average accuracy obtained on a subset of the MNIST data set by different 
semi-supervised algorithms as a function of the oracle-misclassification ratio, when the 
number of labelled nodes is equal to 10. Accuracy is averaged over 50 random realizations, 
and the error bars show the standard error. 
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Figure 5.8. Average accuracy obtained on the unlabeled, correctly labeled, and wrongly 
labelled nodes by the oracle. Simulations are done on 1000 digits (2,4). The noisy oracle 
correctly classifies 24 nodes and misclassifies 16 nodes, and the boxplots show 100 real- 
izations. 


we select 1000 images corresponding to two digits and compute the k-nearest- 
neighbors graph (we take & = 8) with gaussian weights w; = exp (—|lx; — xjl|*/s7) 
where x; represents the data for image 7 and s; is the average distance between x; and 
its K-nearest neighbors. Accuracy for different digit pairs is given in Figure 5.7. We 
notice that the performance of the three algorithms are excellent. But, under large 
oracle noise, the accuracy of Poisson learning decreases more than the accuracy of 
Algorithm 14 or constrained spectral clustering. 

To further highlight the influence of the noise, we plot in Figure 5.8 the accuracy 
obtained by the three algorithms on the unlabelled nodes, the correctly labelled 
nodes, and the wrongly labelled nodes. While the accuracy of Poisson learning 
is excellent on the unlabelled nodes, it fails at correctly classifying the wrongly 
labelled nodes. On the contrary, Algorithm 14 allows for a smoother recovery: the 
unlabeled, correctly labeled, and wrongly labelled nodes have roughly the same 
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classification accuracy. While some correctly labelled nodes are misclassified, many 
wrongly labelled nodes become correctly classified, and the unlabelled nodes are 
better recovered. 


Further Notes 


In many networks, such as social networks, citation networks and knowledge 
graphs, the nodes have features. Thus, it is very natural to try to take into account 
both the graph structure and the features. This idea has been implemented in Graph 
Neural Networks (GNNs). Scarselli et al., 2008 were probably the first to propose 
a framework for the design of GNNs. Then, Defferrard et al., 2016 elaborated 
an efficient implementation of GNN using graph Fourier transform, and Kipf 
and Welling, 2017 have developed GNN in the semi-supervised learning context. 
Several works made nice connections between Personalized PageRank and GNNs: 
Klicpera et al., 2019, Bojchevski et al., 2020, Chien et al., 2020. Recently, many 
works have been published on this topic and an interested reader can find compre- 
hensive reviews in (Wu et al., 2020; Zhou et al., 2020). 

In addition to the analysis presented in Section 5.4, the methods of random 
matrix theory have also been applied to semi-supervised learning in (Mai and Couil- 
let, 2018, 2021). 

With the advance of high-performance computing and cloud computing, one 
needs to consider parallel computation approaches to graph-based semi-supervised 
learning. A few examples of such approaches are presented in (Avrachenkoy et al., 
2016a; Ravi and Diao, 2016; Chen et al., 2020). 


DOE: 10.1561/9781638280514.ch6 


Community Detection in Temporal 
Networks 


Previous chapters focus on the study of static interactions, represented by a binary 
number or a positive weight. Nevertheless, in many application domains, interac- 
tions vary over time. The longitudinal nature of such datasets calls for replacing 
classical graph-based models with temporal network models represented by tensors 
(Holme and Saramäki, 2012; Kivelä et al., 2014). We note that taking into account 
the temporal aspects not carefully, e.g., by aggregating or smoothing the temporal 
data along the time axis, can lead to a loss of valuable information. 

The problem of community detection in temporal networks has recently 
attracted a considerable amount of attention from the scientific community. While 
it led to many interesting results, it also led to an explosion of disparate terminolo- 
gies and algorithms. 

In this chapter, we will first unify existing models of temporal networks with 
communities into a single framework. Then, we will show that the existing works 
can be grouped into two large categories: models with fixed community member- 
ships and models with time-varying community memberships. We will then study 
each of these cases separately. 
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6.1 A General Model of Temporal Networks with 
Communities 


6.1.1 Membership and Interaction Structures 


We consider a block model for temporal networks with 7 nodes, K blocks and T 
temporal snapshots. The observed data consists of a list of T adjacency matrices 
(at, SN ,AT), where each matrix A’ € {0, 1}”*” describes a snapshot of the net- 
work at a particular time instant. Furthermore, at time ¢ the node set is partitioned 
into K latent communities, and we denote by Z; the label of node ż at time ż. 

The matrix Z € [K]”*" represents the membership structure. Each column Z., € 
[K]” consists of the community labels of the nodes at a given time ¢, while the row 
Z;. e [K]T represents the membership pattern of node ż (ż.e., the evolution of the 
community label for node 2). 

We assume that the node membership patterns are independent and distributed 
according to a probability distribution p over [K]". Therefore, 


PZ) = | [r2 (6.1) 
i=1 


Conditionally on the block membership structure Z, we want to generate a ran- 
dom tensor A = (4; e {0,1}”*”*", indexed by node pairs {ż, j} and time 
t, and verifying Aj; = A;; such that the pattern interactions between node pairs 


are independent. We denote by Bgi:r gi:r (x17) the probability of observing an 
interaction pattern xT e {0,1}” between a pair of nodes with block patterns 
RUT = (ki... kr) € [K]? and "T = (€1,...,€7) € [K]’. The interaction 
structure B is thus a collection of probability measures B = (Bgi:r 1:7). Condition- 


ally on B and Z, the law of the random tensor A is defined as 


P(A|Z,B) = [| 823 (457): (6.2) 


l<i<j<n 


The model (6.1)—(6.2) is the most general expression of a block model with 


n nodes and K clusters for a temporal network with T snapshots. Since the size 
(KT) 
2 
measures B,1:7 1:7. Keeping this full generality leads to an over-complicated model. 


of the block membership structure is K x T, there is 


choices of probability 


The following section details some particular cases of interest. 


6.1.2 Examples of Temporal Network Models 


Static memberships, dynamic interactions The block membership struc- 
ture Z is static if the columns of the matrix Z are equal. Equivalently, the 
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community labelling of each node does not vary over time. In that case, we can 
simply denote the static community labelling by a vector z € [K]”. Furthermore, 
the block interaction structure B reduces to an interaction kernel f = (fee) p,ee[K] 
which isa collection of probability distributions on S = {0, 1}? such that See = fer- 
This defines the probability distribution 


Pl) = [| fl) (6.3) 


1<i<j<n 


of a symmetric interaction tensor A € S”*” representing a temporal block model 
with static community structure. The model is homogeneous if the interaction kernel 
takes the form 


f ifk=f, 
See = . 
Tous otherwise. 
Here fin represents the distribution of the interactions within a block while the 
interactions across blocks are distributed according to fut- 


Example 6.1. Let x = (x1, ... xr) € {0, 1}7. A temporal SBM with static mem- 
bership structure has temporally independent interactions, if for all k, € € [K], we 
have fee (x) = he 1 Uke (xt), where uge is a probability distribution over {0, 1}. A 
temporal block model with static membership structure and temporally indepen- 
dent interactions corresponds to T independent observations of a binary SBM. 


Example 6.2. Let x = (x1,...,x7) € {0, 1}7. A temporal SBM with static 
memberships has Markov interactions, if for all k,€ e [K], we have fe = 
Ure (x1) fs Pre (xt—1» Xt), where uke is a probability distribution over {0, 1} 
and Pe is a 2-by-2 stochastic matrix representing the probability of transitions 
between two consecutive snapshots. If Pee (a, 6) = upe(b) for all &, € € [K] and 
a, b € {0, 1}, then we recover the model described in Example 6.1. 


Temporally independent interactions The block interaction structure is tem- 
porally independent, if at any time step ¢ the binary interaction between nodes 7 
and j is re-sampled according to the community labelling of 7 and j at time ¢. The 
law of the random tensor A is then given by 


P(AIZ,B)= [| Qaz (45) (6.4) 


1<i<j<n 


where Q = (Qke)g cex] is a set of distributions on {0, 1}. 
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Markov membership structure Let zT = (z1,...,27) € [K]7 denote a 


membership pattern and recall that p is the distribution of the nodes community 
assignment (see equation (6.1)). The model has a Markov membership structure, if 


T 
plz) = Az, I] Tz, 1,8) 


t=2 


where @ is the initial probability distribution on [K] and z is a K-by-K matrix of 
transitions probabilities. 


Example 6.3. Let 1x = (1,...,1)7 be the K x 1 vector of all ones. The model with 
a Markov membership structure, defined by a = z lx anda = rig + is 171k, 
corresponds to a model, where: 


e at initial time ¢ = 1, the community labelling of all nodes are chosen inde- 
pendently and uniformly at random; 

e at time ¢ > 2, with a probability 7 a given node 7 remains in the community 
it was at time ¢ — 1, while with probability 1 — r the node is assigned to a 
new community chosen uniformly at random. 


6.2 Networks with Static Community Memberships 


6.2.1 Recovery Thresholds in SBM with Markov Interaction 


The model is called Markov Stochastic Block Model, if the community memberships 
are static and the temporal interactions are Markovian (see also Example 6.2). In a 
homogeneous Markov SBM, the interaction kernels take the form 


fin = Le Pr sx a Prep _yxr> 
four = Va Qa Qer_ixr> 


where u, v are the initial probability distributions on {0, 1} and P, Q are the tran- 


(6.5) 


sition probability matrices on {0, 1}. 
In the sparse regime, the probability of observing a non-zero interaction between 
any particular pair of nodes is small, że., 


max{ 41; V1, Por» Qoi} < P. (6.6) 


One particular case is to assume that for some constants u, v, po1, qo1 € (0, 00), we 
have 


Lı =up, vi=vp, Poi =p0p, Qoi = q01P. (6.7) 
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Under this assumption, the expected number of 1’s in a f-distributed signal is 
E DLX < “it (T — 1)Poi = O(p T). Hence, when pT = o(1), the proba- 
bility of observing an interaction in any particular node pair is small. 


The following proposition, whose proof can be found in (Avrachenkov et al., 
2022), states recovery conditions for a sparse Markov SBM when n > 1 and 
T > 1. The notions of consistent and strongly consistent estimators were defined 
in Section 4.4.3. 


Proposition 6.1. Consider a homogeneous Markov SBM composed of n >> 1 nodes, 
K = 1 blocks, T >> 1 snapshots, where fin and fou are Markov chain distributions 
defined by (6.5) and satisfying (6.7) with a sparsity parameter p such that pT & 1. 
Suppose that Pı; and Qi; are constants, such that (P\1,Qi1) # (1,1), and that 


(p01, P11) Æ (Gor, Q11). Let 


z 2 
I = (por — 901) + 2./porqoi ty 
2 ~V -PDU -Qi1) . : ; 
where Hy, =1-— (raga the squared Hellinger divergence between two 
geometric distributions with parameters Pı; and Qi. Then: 


(i) a consistent estimator does not exist for pT S 1 and does exist for pT > i. 
log a 


n 


(ii) a strongly consistent estimator does not exist for pT < and does exist for 
log n 
pT > Z, 


(iii) ia critical regime with pT = (1+0(1))t 


logn 
~ n ~ 
consistent estimator does not exist for tI < K and does exist for tI > K. 


for some constant T, a strongly 


The quantity p TĪ corresponds to the main term in the Taylor expansion of the 
Rényi divergence between two Markov chain distributions fin and four. We refer to 
(Avrachenkov eż al., 2022) for more details and proofs. 


Remark 6.1. We recall from Example 4.1 that consistent recovery in binary SBM 
requires p >> n~!. In particular, Proposition 6.1 shows that consistent recovery is 
possible even in a very sparse regime, as long as the number of snapshots is large 
enough. For example, if p = L, then T has to be at least of the order w(1) for 
consistent recovery to be possible. 


log n 


Remark 6.2. The regime with signal strength p T = (1 + o(1))t = 
esting critical regime. Indeed, in this regime, the phase transition for strong con- 
sistency occurs at T é /Po. — T) +27, [pq Ht}, > K. By comparison, the 
log n 


n 


is an inter- 


interesting regime for strong consistency in static SBM is p = (1 + o(1)) (see 


Example 4.2). 
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6.2.2 Online Likelihood-based Algorithms for Markov 
Dynamics 


In this section, we derive an algorithm for clustering temporal networks with static 
community memberships. We consider situations in which data arrives snapshot 
per snapshot and an online estimate of the community memberships is updated at 
each time step. 


Model parameters are known Given A™* = (A!,--- ,A’), we define a log- 
likelihood ratio matrix by 


fa (4) 
fn (49) 


where fin and four are the intra- and inter-block interaction probabilities. In par- 


M; = log (6.8) 


ticular, the log of the probability of observing a graph sequence A! given node 
labelling z equals 


logP la) = 5) MiG =2)+5 >) fold) 
i j#i i j#i 


l Computed from the observation of the t — 1 


Therefore, given an assignment 2*7 
first snapshots, one can compute a new assignment 2’ such that node 7 is assigned 


to any block k which maximises 


A 2 Mig- i (6.9) 
SA 
This formula is interesting only if the computation of M* can be easily done 
from M*‘~!. This is in particular the case for the Markovian evolution. Indeed, 
if fin and four are given by (6.5), then the cumulative log-likelihood matrices 
defined in equation (6.8) can be computed recursively by M* = M“! + A’, 
where 


Tt H (yl i P fji yt 
M; = log 7 (4}) and Aj = los g (4; Ai) 


We summarises this in Algorithm 15. Let us emphasise that this algorithm works 
in an online adaptive fashion. 

The time complexity (worst case complexity) of Algorithm 15 is O(Kn? T) 
plus the time complexity of the initial clustering. The space complexity is O(n”). 
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Algorithm 15: Online clustering for homogeneous Markov dynamics when 
the block interaction parameters are known. 
Input: Interaction tensor (45); block interaction parameters 4, v, P, Q; 
number of communities K; static graph clustering algorithm algo. 
Output: Node labelling z = GE eee 2n) e [n]*. 
1 
Initialize: compute 2 < algo(A!), and My < log © (45) for 
bJ = lyet 
2 fort = 2, ..., T do 
3 | compute Aj < log G (41,45) for ij = 1,..., 75 
update M «+ M + A. 
fori = 1,...,n do 
set Lip | È izi Mij Sep fork=1,...,K; 
set 2; < arg max; <pex Lik- 


N QA Woe 


Return: z. 


In addition, we note that: 


e since at each time step, Aj; can take only one of four values, these four dif- 
ferent values of Aj can be precomputed and stored to avoid computing 7° T 
logarithms; 

e the n-by-K matrix (L;4) can be computed as a matrix product L = M°Z, 
where M? is the matrix obtained by zeroing out the diagonal of M, and Z 
is the one-hot representation of Z such that Z; = 1, if 2; = k, and zero, 
otherwise; 

e for sparse networks the time and space complexity (average complexity) can 
be reduced by a factor of d/n where d is the average node degree in a single 
snapshot, by neglecting the 0 — 0 transitions and only storing nonzero 
entries. 


Extension when the parameters are unknown Algorithm 15 requires the 
a priori knowledge of the interaction parameters. This is often not the case in prac- 
tice and one has to learn the parameters during the process of recovering communi- 
ties. In this section, we adapt Algorithm 15 by estimating the parameters on the fly. 

Let n4p(i, j) be the observed number of transitions a > 6 in the interaction 
pattern between nodes i and j, and let nq(i,7) = >°, nab(i, j). Let P(é,7) be the 2- 
by-2 matrix transition probabilities for the evolution of the pattern interaction for 
a node pair (z, j). By the law of large numbers (for stationary and ergodic random 
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processes), the empirical transition probabilities 


Nab (ij) 
Na (4,7) 


Pap = (6.10) 
are with high probability close to P(ż, j) for T > 1. 

An estimator of P is obtained by averaging those probabilities over the pairs of 
nodes predicted to belong to the same community. More precisely, after t observed 
snapshots (¢ > 2), given a predicted community assignment 2’, we define for 
a,b e {0,1}, 

ee 1 n, (ij 
>, = “ae (6.11) 
na DAG 
(i): =8) 


where 


t—1 


r=] 


is the number of a > 6 transitions in the interaction pattern between nodes i and 
j (with a, b € {0, 1}) seen during the ¢ first snapshots and 


1 
nif) = D aG). 


b=0 
Similarly, 
Ar) 1 wD 


fen: zzz) (6.12) 


zA nË Gj) 


Gj): 2 


is an estimator of Q,,. Moreover, the quantities n®, (ż, j) can be updated inductively. 


Indeed, 
LUG) = = ní 2G, D+ 1 (4; =a) 1(4; =). (6.13) 
Finally, the initial distribution can also be estimated by averaging: 


1 
at t 
# 20 = A 2 Ay 


[fea TF Gy: 2O=z 


and similarly for ¥’. This leads to Algorithm 16, for clustering Markov SBM when 
only the number of communities K is known. Note that to save computation time, 
we can choose not to update the parameters at each time step. 


148 Community Detection in Temporal Networks 


Algorithm 16: Online clustering for homogeneous Markov dynamics when 
the block interaction parameters are unknown. 


Input: Observed graph sequence X!'7 = (X!,...,X7); number of 
communities K; static graph clustering absorb algo. 
Output: Node labelling = (1, ..., Zn). 
1 
Initialize: 
e Compute 2 + algo (X’); 
© Letna (ij) < 0 for i,j e [N] and a, b € {0, 1}. 
Update: 
2 fort = 2,---,7 do 


3 | For every iode pair (7), update n4¿(ż, j) using (6.13); 
4 | Compute P, Q using (6.11) and (6.12); 


5 | Compute M such that My = X pp nab(i,/) ee on" 


6 | forz=1,---,ndo 
7 Set Lip < Èz: Mijl (2; = k) forall k = 1,..., K 
8 Set 2; 4 arg max, <p< K Lik 


Numerical results 


Evolution of accuracy with the number of snapshots Let us first study 
numerically the effect of the initialization step. We plot in Figure 6.1 the evolution 
of the averaged accuracy obtained when we run Algorithm 15 on 50 realizations 
of a Markov SBM, where the initialization is done either using spectral cluster- 
ing or by random guessing. Obviously, when spectral clustering works well (see 
Figure 6.1(a)), it is preferable to use it rather than a random guess. Nonetheless, 
it is striking to see that when the initial spectral clustering gives a bad accuracy, 
then the likelihood method can overcome it. For example, in Figure 6.1(b), the 
initial clustering with spectral clustering on the first snapshot is really bad (accu- 
racy © 50%, hence not much better than random guessing), Algorithm 15 does 
overcome this and reaches perfect clustering after a few snapshots. In that partic- 
ular setting, there is no advantage in using spectral clustering instead of random 
guessing. Additionally, random guessing is faster than spectral clustering. 


Unknown interaction parameters We next show in Figure 6.2 the compar- 
ison of accuracy obtained by Algorithm 15 (with known interaction parameters) 
and by Algorithm 16 (with unknown interaction parameters). We note that in all 
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Figure 6.1. Evolution of the averaged accuracy given by Algorithm 15 when the initial- 


isation is done via spectral clustering or random guessing. The synthetic graphs are 
Markov SBM with » = 500 nodes (equally divided into two clusters), and with parame- 
ters vy = 1.5 oe Pii = 0.7 and Qi, = 0.3. Accuracy is averaged over 50 realisations, and 
the error bars represent the standard error. 77. is the theoretical minimum number of 
time steps needed to get above the strong consistency threshold. 


theo 


RT 1.0 


> 0. > 0-9 I i 

U | U 

© 0.8 | li © 0.8 i 

: : | 

Y 0.7 Mt Y 0.7 

< 0.6 it --ł-- Algorithm 15 < 0.6 Fi --+-- Algorithm 15 
aa —}- Algorithm 16 ett" —+- Algorithm 16 

0.5 0.5 
0 20 40 60 80 0 20 40 60 80 
Number of time steps Number of time steps 
(a) pı = 0.004. (b) vı = 0.016. 


Figure 6.2. Comparison of accuracy obtained by online Algorithms 15 and 16 on Markov 
SBMs with NV = 400, K = 2 and v; = 0.004. Results are averaged over 25 Markov SBMs and 
error bars show the standard errors. 


the following numerical experiments, we will chose sparse settings in which spec- 
tral clustering on a single snapshot do not provide more information than a blind 
random guess. While Algorithm 15 provides better performance (as expected as it 
does not have to estimate the Markov chain transition probabilities), Algorithm 16 
also provides an excellent accuracy using more snapshots. 

Let us finally study the performance of Algorithm 16 in Markov SBMs for which 
the expected degree in a given temporal layer is less than 1. This corresponds to an 
extremely sparse regime. Nontheless, as we see in Figure 6.3, Algorithm 15 performs 
well, even when “1 = vı, as long as P11 A Qi (see Figure 6.3(a)). This shows 
that Algorithm 16 recovers the communities very well, even in the most challenging 


regimes. 


150 Community Detection in Temporal Networks 


H 
(e) 


yf Ss lt TAA 


f 


COn: 


i 
/ fie = 


= 
= 
EY 
EN 
oO 
oO 
a 


© 
© 0.8 © 0.8 2 
i 5 it =H- a 01 
g 0.7 f ie -+- 0.6 X 0.7 F iit =H- 0.3 
0.6 Yall ll sre 0.9 fat —-- 0.6 
aeaa 0:6 hd as + 0.9 
0 5 ened nes time denii ant andi teid aa bami tes disia mi tines tons oanien _ 0.5 
O 100 200 300 400 500 0 100 200 300 400 
Number of time step Number of time step 
(a) wa = Se. (b) m = 33 


Figure 6.3. Evolution of the accuracy with the number of snapshots obtained by 
Algorithm 16 in an extremely sparse setting. We draw Markov SBM with NV = 300 nodes, 
two communities of same size and parameters vı = = and Pı = 0.6. The different curves 
show the average on 25 Markov SBMs and the errors bars correspond to the empirical 
standard errors. 


6.2.3 Spectral Methods for Clustering Temporal Networks 


We introduced in Section 4.1 spectral methods for static graphs as relaxations of 
various combinatorial minimisation problems. The simplest of those problems is 
the min Cut, że., 


arg min Cut(A, z), 
ze[K]” 


where the arg min runs over all possible node labellings z € [K]” of the node set 
[n], and where 


Cut(4z) = $. Ay. 


i<j: ZAR; 


Let us now consider a temporal network represented by its list of adjacency 


matrices (A, . aot Js If we assume that the temporal snapshots A’ are indepen- 
dent of each sites one could simply generalize the classical min Cut problem by 
considering 
T 
arg min > Cut (a ; z) : 
ze[K]” 1 


Since 1, Cut (4’,z) = Cut (214.2), we would then apply a spectral 
method on the time-aggregated graph (that is, the weighted graph represented by 
the adjacency matrix > A’). 

Unfortunately, this fails at taking into account the time-correlation in the inter- 
action patterns between nodes. As an example, consider a network in which 
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the inter-community interactions are sparse and temporally independent (hence 
forming spikes), while the intra-community interactions are strongly correlated 
in time. Consider two node pairs whose interaction patterns are given by x} = 
(0, 1,0,0,0,0,1,0,0,1) and x. = (0,0,1,1,1,0,0,0,0,0). Since |lx1|]1 = 
\|x2||1 = 3, we see that simple time-aggregation is agnostic to the different time- 
patterns between time series x; and x2 and the important information is lost. 

A possible correction is to account for the persistent links. Indeed, in the above 
example, since xı (resp., x2) has zero (resp., two) transitions 1 > 1, we might guess 
than x2 comes from an interaction between nodes belonging to same community. 
Formally, this can be done by considering 


T T 
arg min > Cut (4, z) +a 5 PerCut (AFLA, z) ; 
z 


t=1 t=2 


where 


PerCut (A7! A’, z) = >. Ay Ai 
Lj: BFR; 


counts the number of persistent links in the cut from time ¢ — 1 to time ż. We 
further notice that 


PerCut (471, A, z) = Cut (a! OA, z) 


where © denotes the matrix element-wise product. The following section justifies 
the intuition of considering persistent edges for clustering temporal networks. 


Degree-corrected temporal SBM with Markov edge dynamics Let us 
firstly present a degree-corrected version of the Markov SBM. A degree-corrected 
temporal stochastic block model with 7 nodes, K blocks and T snapshots can be 
described by the probability distribution 
0:0; 
P(AIZ,F,0) = [| E3 (4) E AT) (6.14) 


1j? 
1<i<j<n 


y7*"*T with zero diagonal entries, where 


of asymmetric adjacency tensor A € {0, 1 
z = (zi, . . ., Zn) isa community assignment with z; € [K] indicating the commu- 
nity of node i, F = (F2) is a collection of probability distributions over {0, i}? 
and @ = (01, . . . , 0n) is a vector of node-specific degree correction parameters, with 
0< 6; < œ. 

In the following, we will restrict ourselves to homogeneous inter-block interac- 
tions with Markov edge dynamics, for which the nodes’ static community labellings 
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are sampled uniformly at random from the set [K] of all node labellings, and 


0:0; AT 7,019; . 
0:0; Hx, H= Pry x if z; = Zj 
Faiz; (x) = sie a (6.15) 
Vx i I= ONA Xp otherwise, 


with initial distributions 


i — 1- 00; 141 vb _ 1- O;0;v1 f 
0:0; 0;0;vi 


and transition probability matrices 


pei — ( — 6;0;Po1 — , Qi = ( — 0;0;Qo1 7) l 
1 — Pii Pii 1— Qy Qı 


The parameters 0;,7 = 1,..., account for the fact that some nodes can be more 
prone than others to start new connections, similarly to the degree-corrected block 
model (Karrer and Newman, 2011). To keep the model simple, we do not add 
degree correction parameters in front of P11; hence once a connection started, 
the probability to keep it active is simply P11 or Q11. Moreover, we assume that 
min; ;{0;0;6} < 1, where ô = max{ 41, v1, Poi, Qoi}. Finally, we normalise the 
degree correction parameters so that >); 1(z; = k)0; = >, 1 (z; = &) for all k. 


Maximum likelihood estimator 
6,0; 
ous 0:0; =j üa” Ave — 
Proposition 6.2 (Avrachenkov et al., 2021b). Let pa ” = log i and la = 


00; 0,0; 
Pa” Poo” PEY ; 
log 4z — log 2%. A maximum likelihood estimator for the Degree Corrected Markov 
Oi; 0:5) 


b 0 
SBM defined by (6.14)—(6.15) is any community assignment Z that maximises 


T 
1 (8: 0b; 0,0; LT) 6:5; 
[45 (00% — 00") 00% + (45-49) A] + 
bj ij t=2 
Zi=zj BB 
6:0; 
0:0; 0,0; z 0:0; g 
| (Eo + Lio) (4 -A7 AG) + 1A 1G — log as | 
00 


over all community assignments z € [K]”. 
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Proof. By the temporal Markov property, the log-likelihood of the model can be 
written as log P(A | z,0) = log P(A! | z,0) + EL P(A’ | A’!, z, 0). By denoting 
A: 


6,0; aJ 
pa ” = log E> we find that 
Va 


1 60; 0,0; 
log P(A' |z, 4) 3 DD oaj 4) (li 3) pe ’ + log v, ') 
2) a 


1 0;0; 
= 5 Deg) D Apapa +), 
ij a 


0:0; 
where qa (4) = DIF >, ô(A;j a)logv, ” does not depend on the com- 
0;0; 
6,0; i 
munity structure. Similarly, by denoting Rp” = log tr» we find that 


b 


log P(A’ | A’!, z, @) is equal to 
1 2 050) 0:0; 
5 LOA oA b (0 aR! + log Qi’) 


ij ab 


1 p . 
= 5 Dong) D OA DA Ry! +A, 
ij a,b 


where ¢,(A) = 5 È; Da HAG, ao(Ai,, b) log oF does not depend on the 


community structure. Simple calculations show that 


0;0; 0;0; 0:0; 0;0; 
Soh apa = ARL — py’) + Po’ 
a 


= 0;0; . 
and that >°,, (A; 5 a)Ò(A;; b)R p is equal to 


0:0; 0:0; 


0,0; 6,0; 0,0; 
rJ t—l1 Lt E 
Roo + Ai (Rio — Roo ) 


) + A; (Roi — Roo 
— 0;0; 0:0; 0:0; 0,0; 
+ Ay ‘AG (Ry -= Ro — Rio + Roy’) 
0;0; —1 0:0; 0,0; 
Roy + Ay a + Ai€ on’ 


pat gt (019) — 010; 0 
+ AA (En — €or’ — £10 )- 
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By collecting the above observations, we find that log P(A | z, 0) is equal to 


1 0;0; 0;0; 0;0; 0:0; 
(A+, [aoi ’— Po + Pa + Ay -api 


Zi=Zj 


0:0; 


0,0, 
0:0; 0:0; t t—1 yt 9:0; ,t-1 4t Qoo” 
00 


where c(A) = J, c-(A) does not depend on z. Hence the claim follows. 


The MLE derived in Proposition 6.2 is more complex than that obtained by 
summing all snapshots independently. In particular, the terms Ay ‘At account for 


persistent edges over two consecutive snapshots. Denote by Apers = A’ TI © A’ the 


entrywise product of adjacency matrices A’~! and A’. Then Apers 18 the adjacency 
matrix of the graph containing the persistent edges between ¢—1 and ¢, and Ah. = 
A! — Apers corresponds to the graph containing the freshly appearing edges between 
time ¢ — | and time ¢. 

Assuming that the number of snapshots T is large, we can ignore the boundary 


terms, and the MLE expressed in Proposition 6.2 reduces to maximising 


2 OO; 0:0 0:0 oi 
19} ij —1 ivj qt—l1 0 
2, 2, (0 + a) (4; =A A) + tir Ay A; — log Ga 


0; 
wy) 
t=2 ij i) 


Zi=zj 
This expression can be further simplify to be expressed as a regularized modu- 
larity. Recall given a weighted graph W, a partition z and a resolution parameter 
y , the regularised modularity is defined as (see Section 4.2 and equation (4.21)) 


did; 
iy 


where d; = È; Wj and m = 1 > &: 


Lemma 6.1. Suppose that P?% and Q®® are non-degenerate, and "i (resp., v”®) 
is the stationary distribution of P”® (resp, Q®®). In a sparse setting, where Py, and 
Qo1 are small, the MLE approximately maximises M(W , z, y ), where W is defined 
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by 
T 
wW = (ai 4 PAs): (6.16) 
t=2 
with 
Poi 1 — Pii Pii 
a = log — + lo , and = log —, (6.17) 
8 Qoi "T= On f "Gn 


andy = (Po — Qo1) P PERLEA E i, 


Proof. Because Po1, Qo1 = o(1), a first-order Taylor expansion yields 


1 — 6,6;Qo1 


log ————._ = 0,6; (Po — Pe als 
°F — 6,6,Po j (Por = Qor) + o (Por + Qor) 


60; 0;0; 0;0; 
ij Por as rw 1-Pi AT nw Pu ‘ 
as well as £o © log gy, fio ~ log jag and fi © logg Using these 


approximations in the MLE expression leads to maximising 


T 


> > 4) (4% — 6:9; Por — Qo) (6.18) 


t=2 ij 


ee 7 ; ; ; P S 
where ai, = a (A di +£ (Aas) _. Since u and v are stationary distributions, 


new 
al 


O64u,1—-Py) ifz;=z, 
0:0;vı(1 — Qi1) otherwise, 


E ( new) i = 


( : ) 0:0;uPi if zi = zj, 
ij 


0;0;¥1Qi1 otherwise. 
Therefore, using W; = So tij, we obrain 


(T— 1)6,6;41 (40 — P11) + BPi1) if z; = zj, 
(T — 1)66v; (a(1 — Q11) + BQi1) otherwise. 


i = 


Since the community labelling is sampled uniformly at random, and 6;’s are prop- 
erly normalised, the expected degree d; is equal to 


gui (a(l = Pig) + Piu) + (K =— 1) (a(1 = Qi) + 2 Qi) 


(T — 1)6; X 
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ep — —_ w wi (aC Pi) +h Pi)+(K—1)v1 (a —Qi1)+ 8 Qi) 
together with m = Ze SS 


observe that 6;0;(Po1 — Qoi) = T where y = (Poi — Qo)(T — 1) 
MOUSA EE eelt, We end the 


Hence, we 


proof using 


equation (6.18). 


Temporal spectral clustering combining new and persistent edges Fol- 
lowing our analysis of the previous section, the MLE is approximately given by the 
solution of 


arg max M(W,z, y), 
ze[K]” 


where W is defined in Equation (6.16) and y is an appropriate resolution param- 
eter. This optimisation problem is NP-complete in general (Brandes et al., 2007), 
but can be approximately solved by continuous relaxation. We can choose the relax- 
ation so that the optimisation problem reduces to normalised spectral clustering 
algorithm on the weighted graph W (see Section 4.4.2). We note that in order to 
compute the normalized Laplacian of W, we should restrict a, 6 > 0, which is not 
necessarily guaranteed by formula (6.17). We summarize this in Algorithm 17. 


Algorithm 17: Spectral clustering for temporal networks with Markov edge 
dynamics and static node labelling. 


Input: adjacency matrices A!,--- , A’, number of clusters K, 
parameters a, p. 


Output: predicted community labels 2 € [K]. 
Process: 
© let W = Dy (aA ew + BA bess)» where Af, =A’ — AP © AP 
and Ao... = A"! O A’; 


pers 
e compute £ = I, — D71? WD~'/? where D = diag(W 1); 
e compute X e RY*K whose columns consist of the K orthonormal 
eigenvectors of £ associated to the K smallest eigenvalues. 


Return: 2 < k-means (o> 2, K ). 


Numerical results 


Synthetic data We first examine the effect of the choice of the parameters a and 
£ in Algorithm 17. For this, we let a = 1 and we plot in Figure 6.4 the averaged 
accuracy obtained on 25 realizations of stochastic block models with Markov edge 
dynamics for various J. While spectral clustering on the time-aggregated graph 
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Figure 6.4. Accuracy of Algorithm 17 on a temporal SBM with 300 nodes, K = 3 blocks, 
and a stationary Markov edge evolution “; = 0.04, vı = 0.02 and Qı; = 0.3. The results are 
averaged over 25 synthetic graph realizations, and error bars show the standard deviation. 


(corresponding to 6 = 1) works well, it is striking to notice that other values of 
B give even better results. The choice of # depends on the probabilities of per- 
sistent interactions. For example, if Pi; > Qı (Figure 6.4(a)), then 6 > 1 is 
preferred, while if Pı} < Qi; (Figure 6.4(b)), the choices of large £ are penal- 
ized. This is in accordance with the recommended values of a and £ derived in 


formula (6.17). 


Social networks of high school students We investigate three data sets col- 
lected during three consecutive years from a high school Lycée Thiers in Marseilles, 
France (Fournet and Barrat, 2014; Mastrandrea et al., 2015). We presented these 
data sets in the introduction. In particular, nodes correspond to students, interac- 
tions to close-proximity encounters, and communities to classes, with dimensions 
given in Table 1.1. 

We make a hypothesis that the temporal characteristics of the interactions are 
similar each year. We then use the 2011 data set to estimate the transition probabil- 
ity matrices P and Q, and use these for clustering the 2012 and 2013 data sets. We 
assume that 6; = 1 (no degree correction). A standard estimator of Markov chain 
transition probability matrices (Billingsley, 1961) gives 


p (09992 0.0008) ad D= 0.999967 3.3 x 1075 
0.37 0.63 ~ \ 0.48 0.52 


Using (6.17), leads to & = 2.9 and B = 0.18. We observe in Figure 6.5(b) that 
this choice of parameters gives a better accuracy on the 2013 data set than simply 
applying spectral clustering on the time-aggregated graph (a = £ = 1). For the 
2012 data set (Figure 6.5(a)), this improvement is not so clearly visible. 

To understand why Algorithm 17 performs better for 2013 than for 2012, we 
have listed in Table 6.1 temporal transition probabilities and clustering weights 
G, f estimated separately for each data set. For year 2012, the difference between 
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Figure 6.5. Accuracy of Algorithm 17 on the 2012 and 2013 high school data sets, using 
uniform a = $ = 1 (blue) and adjusted a, , whose values are predicted using 2011 data 
(orange). 


Table 6.1. Markov chain transition probabilities and adjusted clustering weights estimated 
separately for each data set. 


= 


Dataset Por Qoi Pi Qu a B B/a 
2011 0.00080 0.000033 0.63 0.52 2.9 0.58 0.060 
2012 0.00050 0.000011 0.57 0.56 3.8 0.01 0.003 
2013 0.00150 0.000014 0.64 0.40 4.5 0.07 0.015 


intra-community edge persistence Py, and inter-community edge persistence Qu 
is small, implying that persistent edges do not add much extra information for 
distinguishing communities ( ~ 0). For years 2011 and 2013, this difference is 
larger, manifesting that edge persistence contains information that can be employed 
to recover communities with a higher accuracy. 


6.2.4 Clustering for Long Time Horizon Using Empirical 
Transition Rates 


We continue to study the temporal SBM with static memberships and homoge- 
neous Markov interaction kernels, as defined in Example 6.2. We denote by P, Q 
the transition probability matrices. Let us consider the situation when the number 
of snapshots T goes to infinity while V remains bounded. The main idea is to use 
the ergodicity of the Markov chains to estimate the parameters using standard tech- 
niques, and then perform inference. For now, we will assume that the interaction 
parameters P, Q are known, but K is unknown. We refer to Remark 6.3 when P, Q 
are unknown as well. 

Recall that formula (6.10) gave consistent estimators for P(i,7), the matrix of 
transition probabilities for the evolution of the pattern interaction between a node 
pair (ż, j). Then, once all P(ż, j) are known with a good precision, we can use our 
knowledge of P, Q to distinguish whether nodes 7 and j are in the same block or 
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not, and use this data to construct a similarity graph on the set of nodes. This leads 
to Algorithm 18 which does not require a priori knowledge about the number of 
blocks, but instead estimates it as a by-product. Note that this algorithm is tailor- 
made for homogeneous interaction arrays. 


Algorithm 18: Clustering by empirical transition rates. 


Input: observed interaction tensor (45); transition probability matrices 
P,Q. 
1 .Output: estimated node labelling z = (21, . . . , Zn); estimated number of 
communities K. 
2 
3 V &{1l,...,n}and E & Ø. 
4 for all unordered node pairs ij do 
5 | compute P,,(i,/) for a, b = 0, 1 using (6.10). 
6 | if Paii) — Pal < 5 |Pab — Qail for some a, b then 
7 | | set Ee EU {ij}. 


8 Compute C < set of connected components in G = (V, E) and set Ke 
|C| and (C1, ..., Cz) <— members of C listed in arbitrary order. 

9 fori =1,...,ndo 

10 | 2; < unique & for which C; > 7. 


Theorem 6.2. Consider a homogeneous Markov SBM with n nodes, K communities 
and T snapshots. Assume that n is fixed, and the transition probability matrices P, Q 
are known. Then with high probability Algorithm 18 correctly classify every node when 
T goes to infinity, as long as the evolution is not static and P # Q. 


Proof. For a,b € {0,1}, let m(4j) = >), (i,j) where n,,(7,7) counts the 

observed number of transitions a — b between a node pair (4,7). The distri- 

Ngh (tj) —Na (tf) Pab (ij) 
EVA nalij) 

mal distribution with the zero mean and finite variance given by A(ad),(ced) = 

Onc (OpaPaplis J) — Pali pnPadG j) (see Billingsley, 1961, Theorem 3.1 and for- 

mula (3.13)). Therefore, for any a > 0, 


P((Pulif) — Pod a) = P(w > aval), 6.19) 


and this quantity goes to zero as T goes to infinity. 
From model identifiability, P # Q. Therefore, without loss of generality, we 
can assume Po; # Qo, and choose a such that 0 < a < Pov on | The nodes 7 


tends to a nor- 


bution of the random variable €,,(,j) = 
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a ee, 
and j are predicted to be in the same community if Po; (7,7) > Port Gor and the 


probability of making an error is 
P (Por Gs) — Poi (i,f)| > a). 


By the union bound, the probability that all nodes are correctly classified is 
bounded by 


n(n — 1) 


7 P (|Po1 Gj) — Por s)| = a), 


where the maximum is taken over all nodes pair 7. By equation (6.19), for all node 


pairs 77, we have P( (Por (4,7) — Pos) = a) — 0. Therefore, all nodes are a.s. 


correctly classified as T — oo. 


Remark 6.3. If P and Q are unknown, we can add a step, where the estimated tran- 
sition matrices, P(i,7), are clustered into two classes (for example using k-means). 


6.3 Markovian Evolution of Community 
Memberships 


This section focuses on clustering temporal networks, whose membership structure 
follows a Markov chain, but the interaction structure is time independent. Specif- 
ically, we denote by z; € [K] the group membership of node ż at time ¢. Then, 
across nodes, the random variables (z;-)1<;< 7 are independent and identically dis- 
tributed. For each node 7, the group membership z; = (z;j1, ++- , zirT) follows an 
irreducible and aperiodic Markov chain, given by 


T 
P (z;) = Qz; I] T zitio (6.20) 
t=2 


where a is the initial distribution and z is the transition probability matrix. Con- 
ditioned on the node labels, the edges are independent, and for all 7 < j and all z, 
we have 


t 
Aj | Zits Zi X Ber (P) . 
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The likelihood of the sequence of adjacency matrices AN’ = (Al,--- Al) is 
therefore 


n T T 
l A; 1—4ź, 
p (4'7 |Z) = | |e (zi1) | ae | | | [20 S Penge) Ë. 
i=1 t=2 


t=1 i<j 


6.3.1 Variational Expectation-Maximization Algorithm 


Let us first assume that K is known, and that a is the stationary distribution of 7 . 
We aim at estimating the group memberships Z = (zjr)}<j<n,1<+<7 a8 Well as the 
model parameters 0 = (z, P) where P = (Pie) gee 

While the global maximisation of the likelihood is intractable when 7 or T are 


large, the Expectation-Maximization (EM) algorithm (Dempster et al., 1977) pro- 
vides a way to find local maxima. EM algorithm computes the conditional distri- 
bution of Z given the observation A". However, in our case this distribution does 
not factor into a product over the 7 nodes because of dependencies. Indeed, we have 


T 
P(Z14'7) = P (2114) [JP (l-14), 
t=2 
where z.p = (zin *** ,Znt) denotes the community labels at time ¢. Unfortu- 


nately, the distribution P (Z* | Z1, A’) cannot be further factored as the random 
variables z; | Aj; and zj | Ai; are not independent. Indeed, by observing an edge 
between 7 and j at time ¢ the likelihood that zy = z increases. The variational 
approximation introduces a class of probability distributions Q such that 


n n T 
Q: Z) = [[ 2. ki) = ite (zi) | [Q (ir | Zie-1) - 
i=l i=l 1=2 


We introduce t(i,k) = Q (z; = &) and t (t,i, k, £) = Q(z = £ | 4-1 = $). 
Thus, under Q, the distribution of (z;1, . . . , zir) is a time-inhomogeneous Markov 
chain with transitions T (ż, ¿, k, €) and initial distribution t(z,). In particular, 


ie 1G, 4) = land ©, r i k,l) = 1 and 


n K T 
a= [| [ca] [[ ren mmmn, 


i=1 k=1 t=2 1<k,£,K 
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The marginal distribution Tmarg(t, 7, 4) = Q (zi = &) are computed recursively by 


Tmarg (t, i, k) = ti, k), 


K 
Tmarg(ts isk) = >. tmarg(t — 1,4,€)r(t, 4, €, 8). 
f=1 


Variational Expectation-Maximization (VEM) algorithm (Matias and Miele, 
2017) then seeks to maximise 


J(0,t) = Eg (logP (4"7,z)) +H(Q), 
where H (Q) denotes the entropy of Q. Hence, /(9, t) is equal to 


K 


> > t (i, k) [log ap — log t(i, k)| 


i=1 k=1 


T an 
+5> > Tmarg(t — 1, 4, k)T (t, i k, £) 


t=2 i=1 1<h0<K 


x [log ree — log t (t, i, k, €)] 


i 
+ » 5 5 Tmarg (¢, i, k) Tmarg(Ż, j» £) 


t=1 1si<jsn1<ke<K 


x log (Ber (ex) (4;)) > 


it Zj if Æ. = 1, 
Ber (au) (4;) = e 1 i 


1 = Pzizy otherwise. 


with 


The optimisation is done iteratively. At step k, with current estimates (t4, 0"), we 
perform the two following sub-steps: 
1. VE-step: compute t*+! = arg max, J (0%, T); 
2. M-step: compute 0+! = arg max, J (0, ct), 
The following lemma provides the value of the updates t*+! and 6+!. 


Lemma 6.3. The value T = arg max J (9,7) verifies 


n K T I 
7 Tmar; Gj k) 
T(t, i, k, £) X Tke I] Il (Ber (Pas) (4;)) 


j=l ¥#=1 
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where the proportionality insures the normalisation constraints on t. Similarly, 0 = 


arg max J (t, 0) is given by f= (7, P) such that 
0 


T a 
Tee x D>. tmaglt - 1 i k) (t, i, k, £), 


t=2 i=1 
T 3 i 
Sere ei (4; r 0) 
T i š : 
Dural Dui<ijen Tmarg (t 4, k)Tmarg( 4, £) 


Pe = 


Proof. The proof follows from a direct derivation of /(t,@). For example, we have 


of 


art,ike) = Tmarg (¢ = 1,2, k) [log mee = log t(t, i, k, £) + 1] 


+ T (£ i ks Otmarg(? — 1,4, k) log (Ber (pexas) (4%) 


and equating this derivative to zero leads to the stated expression for T(z, 7, k, £). 


Finally, æ is obtained by computing the empirical mean of the distribution Tmarg 
over all data points, i.e., 


T an 
1 = 
VRE [K] : = = OT a Tmarg 1, k). 
=1 


+ 
N. 
— 


6.3.2 Belief Propagation Using the Space-time Graph 


While the maximum likelihood estimator finds the membership structure that max- 
imises the likelihood by solving arg max P (Al! | Z), we will here instead try to 
find, for every node, the block assignment that maximises its marginal likelihood. 
More precisely, the marginal likelihood y;(¢) is the probability that node i belongs 
at time ¢ to block & according to the posterior distribution, and is given by 


vi) = P (zx = ja) 
Then, for every time ¢ we will assign node i to be in block 2; such that 


Zip = arg max y(t). 
ke[K] 
To compute the marginal, we model as if every neighbors j of a given node i at 


time ¢ sends a message y, >j (t), which is an estimate of the probability that ż is in 
community & if node j was not here. 
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Since the graph is temporal, we also have to take into account the temporal 
evolution. More precisely, at time ż each node ż receives a message from its past and 
i(t—-1)— i(t) i(t) i(t+1) 


future copies, denoted by y and y 


The update equation for the spatial messages is 


y AO) x (x Tke ee) (x Te oo) 


£ £ 
x T] Xto 
pAj=l £ 

J#i 


where the proportionality hides a factor imposing the normalisation condition 
>, y= = 1. The update equation omits the non-edges, as in sparse networks 
they can be approximated as a global interaction. Moreover, the update equation 
for We Fa) does not involve the message that j sends to 7, to avoid any “echo 
chamber” effect, where information would be amplified between 7 and j in a noisy 
fashion (for more details see Moore, 2017). In the similar fashion, the update equa- 
tion for temporal messages is given by 


eh? (Sonus?) I DIAO) 
[4 


Ji Aj=! £ 
and a similar expression also holds for yi Dm, 

Belief propagation consists of initializing the messages randomly and then repeat- 
edly updating them with the update equations. This is typically done asyn- 
B by first choosing a node 7 and a time ¢ uniformly at random and updat- 
ing w, /(t) for all j and k, as well as goes nd yi D>) When conver- 
gence occurs, we compute the marginal of each vertex using 


wilt) ee (x ae yee ] G The yee? a) 


£ 


x |] Duve 


j: Ay =] 
We finish this section by noticing that when a = rlx + is 171 Kg, we have 


a j PAE ik= 
D rupi PPO = ryf PO 4, 
4 


; ; : : {= 
Da es = ryt DO 4 a 


K 
A 
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which further simplifies the update equations. Furthermore, in a homogeneous 
model (pee = pin if k = £ and Pour otherwise), we have 


5 pay 1-A 
2 puvi (t) = Aw, “) +: 


6.3.5 Online Inference as a Semi-supervised Problem 


The lagging problem 


In a framework where community membserhips vary with time, clustering by 
applying directly the time-aggregated spectral methods derived in Section 6.2.3 
would fail. Indeed, time-varying community memberships lead to a contamination 
of the information given by the past interactions. For example, if node 7 changes 
its community assignment at time ¢;, then one should not use the interactions of 
node 7 during the first ¢; snapshots to find its community membership at time 
t > tı. This /agging problem especially complicates the situation when the layers 
are temporally correlated. To avoid this issue, we propose an online recovery of the 
node labels. More specifically: 


e attime¢ = 1, we use a static community detection algorithm to output 2.1 = 


(Z11,°++ > Žn1), a prediction of the initial node labels z.) = (z11,-++ > Zn1) 
from the observation of the first snapshot A}; 

e attimeż > 1, we will use the observation of the first £ snapshots A!,..., A’ 
as well as the previous predictions 2.1, «++ 2.+—1. This will be treated as a semi- 


supervised learning problem, where the prediction 2.+—1 done at the previous 
time step is seen as a noisy oracle for the true node labelling z., at time ż. 


From the Markov structure, the prediction at time ż > 1 reduces to predicting z., 
using only the network at time ¢ — 1 and ¢ and the previous prediction 2.;-1. This 
can be interpreted as a noisy semi-supervised problem with oracle (see Section 5.4), 
where the previous prediction Z.;—1 plays the role of the oracle information for the 
node labels at time ¢. This oracle is noisy, as it bears two kinds of potential mistakes. 
Firstly, Z.,-1 is not necessarily exactly equal to the perfect community labelling 
Z+-1. Secondly, since the node labels vary through time, z.;-; does not precisely 
correspond to 2.;. 


6.3.4 Degree-corrected Temporal SBM with Markov 
Community Memberships 


In addition to the Markov community structure described in (6.20), we will assume 
for simplicity that the initial labels and the transitions are uniform, that is 
1 


_ 
a= g and T = nie + klk 
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In other words, a node keeps its label with probability 7 € [0,1], and choose a 
label uniformly at random with probability 1 — y. 

We then assume that the pair interaction between two nodes i and j is a Markov 
process depending only on the community labelling and on some degree correction 
parameters 0 = (01, --- ,@y). In particular, 


P20) = [| P(Ablen.21.6,8) 


l<i<j<N 
T 
t t—-1 
[P (4; |A; > Zits Zits Oi, oj) . 
t=2 


We further consider a homogeneous model in which the initial distribution is 
given by 


0;0; 1 : = 
i eer (43), if z; = ZA, 
P (Ay | 21152199) = 1 oo fa 
pi (43) , otherwise, 


y 
and the transition probabilities are given by 


0;0; 
= Pats if Zj¢ = Zip 
P(A. = b| A! = a, zim zi 0in0) = 1 Bo a, 
1] 1] J J 177 . 
p> otherwise. 


Similarly to Section 6.2.3, the degree-corrected initial distributions are defined by 


ge = 1 — 6,0; 9} z 1 — 0;0;vı ! 
0:0; 0;0;vi 


and the transition probability matrices are given by 


p — ( — 0;0;Poi a. QG = ( — 0;0;Qo1 oo), 
1 — Pii Pii t=O Qi 


with the assumption min; ;{0;0;6} < 1, where ô = max{ u1, vi, Por, Qoi}. We 
normalise the degree correction parameters so that for all k it holds that $; 1 (z; = 
k)O; = >); 1(g1 = &). Finally, we suppose that the transition probabilities and 
the degree-correction parameters do not vary with time, to avoid any parameter 
identifiability issues (Matias and Miele, 2017). 


Online Maximum A Posteriori estimator 


The following proposition gives the expression of the MAP estimator for the pre- 
sented online learning problem. 


Markovian Evolution of Community Memberships 167 


Proposition 6.3. Lets € [K]” be a noisy oracle on the node labels at time t, which 
is supposed to be independent of the observed interactions A. Define the rate of mistake 
ofs as p =P (si £ Zie) and assume this rate is the same for all nodes. A Maximum A 
Posteriori estimator for the online learning problem described above is defined by 


Z, = arg max P (z| A’, At, s) 
ze[K]” 


and is any labelling z € [K]” that maximises 


0:0; = 0:0; = = 0:0; 4 
` ti (4; -A i) me (4; 1A; ZA) HOVATTA 


R= 


Qn n 
0 
— log i | +22 1 (z; = si), 
Poo i=l 
0,0; 0,0; 
P 1 F P 1 F _ 
d — log 2 and à = log 2. 
iJ ij P 
Pa Poo 


0;0; 
where £ = log 
Proof. By Bayes rule, 
P (z| A,A, s,0) « P (#147, z,s,0) P (z1471,s,0), 


where the proportionality symbol hides a term P (4 | A’, s, 0) independent of z. 
Since P (4 | A7}, z,s,0) = P (4 | A7}, z,0), then proceeding similarly to the 
proof of Proposition 6.2, the log-likelihood term log P (A’|.A’~!,z,0) can be 


rewritten as 


0:0; ( at a er 6:0; f ,t-1 t—1 yt 
; G (45; — A545) + tio (Ay! — 4545) 


0:0; 

0:0; 41-1 yt Qoo 
+ €) Ay A; — log ag; È 
00 


The oracle information is equal to 


T Pili) 
P = P (z; 
(ls) II Pe l ® 


; 1\” 
= (1 — p)ltela: z= liel: zA f 
a-p) p ra 


B p l{e[a]: zi#si} 4 y 17” 
~ \l-p p K 


where we used the uniformity of the node labels. 


168 Community Detection in Temporal Networks 


Continuous relaxation of the MAP 
For simplicity of the derivations to come, in this section we restrict the study to 
K=2. 

Denote by Af ers 
edges, by Anew = A’ — Apers the adjacency matrix corresponding to freshly formed 


= A’! © A’ the adjacency matrix corresponding to persistent 


edges, and by Asiq = AT! — Apers the adjacency matrix corresponding to disap- 
pearing edges between time t — 1 and ¢. Then, using the Taylor expansion as in 
Section 6.2.3, we can approximate the MAP estimator by 


E dd? T 
arg min —z | W — t — | z+4(6s— z) 6-2) (6.21) 
ze{—1,1}” 2m 


where W” = a1 Ahey + 410444 + a11AÁpers with a44 = log a“, T is a resolution 
parameter, dj = J; Wy» and m = i Dj di. 

This minimisation problem is analogous to the one studied in Section 5.4 for 
noisy semi-supervised clustering in the DC-SBM. We can also propose the follow- 
ing continuous relaxation 


x = argmin —xT Mx +A(s— x)? (s— x), 
xeR” 
xT Dx=2m 


where D = diag(d),--- ,d,)andM = W-t n The solution ofthis relaxation is 
determined by mimicking the reasoning of Section 5.4.2. In particular, by denoting 
the eigendecomposition of D~'/* (—M + AI,) D~'/? by 

D! (—M +å) D7! = COAG" 


with A = diag(ô1,... , ôn) and QQT = l, and letting 6 = 1Q’s, we obtain that 


x verifies 
GML- hD E (6.22) 


where y, is the smallest solution of the explicit secular equation (Gander et al., 1989) 


n b; 9) 
(4) — 2m = 0. (6.23) 


i=1 


This leads to Algorithm 19. 
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Algorithm 19: Online clustering of time-varying communities. 


Input: Observed graph sequence AT = = (Ass ,AT); number of 
communities K; static graph clustering algorithm algo; parameters 
A401, 410,811 and A; races: ÀT. 
Output: Node labelling Z = (zis). 
Initialize: Compute 2., <— algo (A'). 
1 fort = 2,...,7 do 
2 Corva w= = 01 Anew + 0A") at 11 Av ers: 


Compute M = W — d” where di = Dij= Wj and m = DAA 


Let y * be the smallest aon of Equation (6.23). 
Compute x as the solution of Equation (6.22). 


ann eR Ww 


Let 2., = sign(x). 


Numerical experiments 


We compare in Figure 6.6 the averaged accuracy obtained by Algorithm 19 with 
Algorithm 17 (spectral clustering with persistent edges) and an algorithm perform- 
ing spectral clustering on each snapshot individually. In particular, we observe that 
when 7 = 1 (że., static community structure), Algorithm 17 is extremely efficient, 
as expected. Since it takes into account all previous snapshots, it in particular out- 
performs Algorithm 19. On the contrary, when 7 # 1, the lagging problem arises, 
and Algorithm 17 ends up with a very poor accuracy after a few snapshots. On the 
contrary, Algorithm 19 keeps a very high accuracy over all snapshots. 


SE fe PE RNR Lo 
; i 
50:9) H a 
(8) ja ditty Y eta (8) Gh anra pyrenees tnet Santee Sai pat 
Soa petr Stn ear ! Oo \\ 
es: pa 
5 5 
Y 0.7 9 0.7 Ny Lata Aldi 
xq -|- weighted SC xq —{-— weighted SC 
0.67 —{—. online-ssl 0.6 "wig —{- online-ssl 
-]— individual SC —]— individual SC 
0.5... 0.5... —, 
O 10 20 30 40 50 60 70 O 10 20 30 40 50 60 70 
Number of time steps Number of time steps 
(a) 7 =1 (b) n = 0.85 
Figure 6.6. Accuracy of Algorithm 19 (online-ss/) with ao; = 1,419 = 0 and aj; = 2, on 


time-varying Markov Block Models with 300 nodes and K = 2 blocks (with uniform prior), 
and a stationary Markov edge evolution w; = 0.05, vı = 0.02, P}; = 0.7 and Q); = 0.3. The 
results are averaged over 25 synthetic graphs, and error bars show the standard error. We 
compare with Algorithm 17 (weighted SC with a = 1, 8 = 2) and an algorithm performing 
Spectral Clustering on each snapshot individually. 
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Figure 6.7. Accuracy of Algorithm 19 with ao; = 1, aio = 0 and an = 2 and various values 
of 4. Simulations are performed on time-varying Markov Block Models with n = 300, K = 2, 
My = 0.05, u2 = 0.02, Pi; = 0.7, Qiu = 0.3 and y = 0.9. The results are averaged over 25 
synthetic graphs, and error bars show the standard error. 


In Figure 6.6, we choose A; to be constant and equal to 0.5, while Figure 6.7 
explores other possible values. We observe that when A; is equal to a constant in the 
interval [0.1, 1], Algorithm 19 outputs similar performances. On the other hand, 
when 4 becomes too large, Algorithm 19 gives too much importance to the oracle, 
and the accuracy becomes worse. In practice, the choice of the parameters 4, could 
be optimised from the data, e.g., based on y or on the transition matrices P and Q. 
Moreover, it would be intuitive to increase A, with ż, as the confidence in the oracle 
is higher when more temporal data is available. We leave this as a topic for future 
work. 


Further Notes 


We refer to Decelle eż al., 2011; Moore, 2017 for an extended description of the 
belief propagation techniques. Belief propagation was introduced for dynamic net- 
works in Ghasemian et al., 2016, and the extension for models incorporating link 
persistence is considered in Ghasemian, 2019. Similarly, Barucca eż al., 2018 stud- 
ied a model with a Markov evolution of the community memberships and link 
persistence. While their interaction setting is restrictive, they showed that edge per- 
sistence increases the difficulty of community recovery. 

Finally, some models also allow for an evolution of the interaction parameters 
over time (Xu and Hero, 2014; Bhattacharyya and Chatterjee, 2020). Nonethe- 
less, it is important to note that identifiability issues often occur when both the 
memberships and the interaction kernels vary over time (Matias and Miele, 2017). 
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Chapter 7 


Sampling in Networks 


Many networks, including online and offline social networks, exist for which it 
is impossible to obtain a complete picture of the network. This leaves researchers 
with the need to develop sampling techniques for characterizing and studying large 
networks. 

The general problem of sampling in a network can be formalised as follows. Let 
G = (V, E) bean undirected network with n = |V| nodes and m = |E] links. We 
would like to design an efficient estimator of the average of a network function 


f= fe. (7.1) 


veV 


Despite the simplicity of the problem formulation, this can be used to describe 
many real-world statistical questions. Let us give just a few examples. 


e How young is a social network? — Take as f (v) the age of node v; 

e How many friends on average has a social network member? — Take as f (v) 
the degree (the number of friends) of node v; 

e What is a proportion of a certain sub-population in a network? — Take f (v) = 
1 if v belongs to that sub-population and otherwise f (v) = 0. 
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7.1 Overview of Sampling Methods 


711 Independent Uniform Sampling 


Clearly, the simplest unbiased estimator is the independent uniform sampling estima- 
tor. That is, obtain a set of samples v;,,..., v;,, sampling each node independently 


with some probability p. Then, 


f% = P2 foi). (7.2) 


Strictly speaking, here we consider sampling with replacement. However, in large 
networks hitting the same node twice occurs with a very small probability. This 
approach is widely attempted in practice but has at least the following two draw- 
backs: (i) in most cases, it is not easy at all to perform a uniform sampling. Take for 
example a questionary over the phone. If the phone numbers of stationary phones 
are mostly used, this can give an age-based bias. (ii) If one is interested in the study 
of a very small sub-population, it can be extremely difficult to collect enough sam- 
ples from that sub-population. The latter concern has given motivation for the 
development of a number of methods based on the chain-referral approach, see 
e.g., Goodman, 1961. 


71.2 Snowball Sampling 


Snowball sampling is the first “naive” chain-referral approach. In a chain-referral 
approach, the sampling process starts from an initial subject (node), who provides 
one or several contacts of his/her friends (neighbours). Then, each new contact 
subject is approached for a questionary and then, after the questionary is completed, 
is asked to provide his/her contact list. This process continues until a sufficient 
number of samples is collected. The naive snowball estimator is very similar in its 
form to the estimator (7.2), namely 


f® = 22 foi), (7.3) 


where v;,,...,v;, are contacted subjects, who gave answers. 

Note that if at each stage we query just one neighbour from a contact list, this 
will correspond to a random walk on a social network. 

One important problem with the naive snowball sampling is that the nodes with 
many neighbours are over-sampled (Erickson, 1979) because the random walk is 
more likely to come to such nodes. 
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71.3 Metropolis-Hastings Sampling 


One natural way to mitigate the over-sampling of large degree nodes is to use the 
classical Markov chain techniques of Metropolis et al., 1953—Hastings, 1970b. 

To avoid the bias with respect to node degrees, we would like that the target 
distribution of the random walk will be uniform, że., æ (v) = 1/n. Then, according 
to the Metropolis-Hastings (MH) approach, we should change the probability of 
neighbour node selection to 


: loo. (1 he ts hı A 
vuu = —— min jl, = min į 1, —— 
Pou = I) Tpu] dO) d(u) 
_ 1 
~ max{d(v), d (u)} 
if (v,u) € E and v # u, whereas p,, = 0 if (v,u) ¢ E and v F u, and finally 
Pu =1-— È TOFI if u = v. To summarise, we have 
0, if (u,v) g E, 
Dü = UTAT if (u, v) € E and v £ u, (7.4) 
1 Ae 
Ll) 4 sae ifu = v. 


Using the central limit theorem for Markov chains (see e.g., Brémaud, 1999), 
Ayrachenkoy et al., 2018b established the asymptotic consistency for the estimator 
(7.3), where the samples v;,,..., vi, are generated according to (7.4). Specifically, 
we can state the following theorem. 


Theorem 7.1 (Central Limit Theorem for MH-estimator). For MH-estimator, it 
holds that 


Ve (FY -f) =; N (0, ojig), as k—> oœ, 
where of = EfTZf — FTF - (1¢71)*, fT = (f (1),..., f (n)) and where 
Z=(1-P+ eli?) is the fundamental matrix. 


In the context of online social networks, the use of MH-estimator was first pro- 
posed by Gjoka et al., 2010. 


71.4 Respondent-driven Sampling 


One important problem with MH-estimator is that it resamples many nodes and 
thus it is not very efficient. This problem can be corrected by the Respondent-Driven 
Sampling (RDS) proposed in a series of works by Heckathorn, 1997; Salganik and 
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Heckathorn, 2004; Volz and Heckathorn, 2008. In RDS the underlying sampling 
process is carried out with a standard random walk but the estimator is modified 
as follows: 


(7.5) 


where d(v;,) is the degree (the number of neighbours) of node v; and m is the 
number of links in the network. Of course, the value of m may be not available or 
difficult to estimate. This is mitigated in the following modification of the RDS- 
estimator 


Easo) 
EE, 1/dlo;,) 


The RDS-estimators (7.5) and (7.6) are asymptotically consistent and the corre- 
sponding CLTs can be found in Avrachenkov et al., 2018b. 


fo = (7.6) 


71.5 Respondent-driven Sampling with Uniform Jumps 


The RDS-estimator still has the following problem: the random walk can be 
trapped in a sub-network with few connections to the other parts of the network. 
To overcome this problem, Avrachenkoy eż al., 2010 suggested to combine the ran- 
dom walk with uniform jumps. Specifically, let us modify the network adjacency 
matrix A in the following way: 


ASA 
n 


Namely, we add an artificial link with the weight a between any two nodes. One 
interpretation of this modification is that we combine the random walk based sam- 
pling with the uniform sampling. 

Typically, the weight a is small, as one sample of the uniform sampling is more 
costly than one sample of the random walk based sampling. For example, in an 
online social network, where users are associated with unique numeric IDs, uni- 
form node sampling is performed by querying randomly generated IDs. In practice, 
however, these samples are expensive (resource-wise) operations as the ID space in 
an OSN, such as Facebook and Myspace, is large and sparse. For instance, in Mys- 
pace only 10% of the IDs belong to valid users (Gauvin eż al., 2010), że., only one 
in every ten queries successfully finds a valid Myspace account. In this example, a 
natural choice for the parameter @ is 1/10. 

Note that the random walk on the weighted graph defined by A is still a random 
walk on an undirected graph and hence its stationary distribution is proportional 


Overview of Sampling Methods 175 


to the weighted degree, że., 


Z d(v)+a ld(v)+a 
Tv) = = = 
2m+ an n d+a 


where d is the average degree of the network. Thus, we can modify the RDS- 
estimator as follows: 


3 fo)  ndta fhi) 
fo = 1 fe = r D (7.7) 


If the average degree and the total number of nodes are unknown, one can use a 
modification, similar to (7.6), że., 


Def Od) +a) 
EE 1/(d(v;,) +a) 


One more natural candidate for the combination of a random walk with uniform 


fO = (7.8) 


restart is the modification in PageRank style. That is, one can change the transition 
probability matrix of the random walk as follows: 


Z 1 
P = (1-6) P+e-11". (7.9) 
n 


One big disadvantage of this approach is that even in the case ofan undirected graph 
the stationary distribution of P, PageRank, does not have an explicit expression, 
which could be used in (7.7). Of course, one can then use Metropolis-Hastings 
modification of the transition probabilities. However, as we noted before, such 
modification leads to frequent resampling. 

Interestingly, the random walk with uniform jumps defined by the modified 
adjacency matrix A can be viewed as PageRank with node-dependent restart proba- 
bility. To see this, we can transform the transition probability matrix of the random 
walk with jumps as follows: 


P 


ü+ah Attu 
n 


1 
(O-hal) DD" A+ O40) tal-11", 
n 


which is the expression (3.8) with the restart probability matrix 


C = (D+al)'D = diag (2) 
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and the personalization distribution v = 11r, Thus, 7 (v) is the Occupation- 
Time Personalized PageRank (OT-PPR) of node ż, defined in (3.9). In particular, 
the expression for C means that the random walk with jumps restarts with higher 
probability from large degree nodes. 

Finally, the expressions (3.14) and (3.15) give us a useful formula for the expected 
time between consecutive restarts 


-1 
E[time between consecutive restarts] = (x (: — ae) ) as 3) 
ae d(i) +a} 2m+an 
_ 2m+an 
z na 
_ d+a 
aos. 


This formula allows us to tune the frequency of jumps by varying the parameter a. 


71.6 Ratio with Tours Estimator 


It may be difficult to do uniform sampling even from time to time. Therefore, 
instead of creating artificial links between all nodes, we can create artificial links 
between some nodes. Intuitively, it is beneficial to create artificial links between 
nodes from very different parts of the network. This should significantly increase 
the mixing time of the random walk. We can also consider the artificially linked 
nodes as one super-node. Let us denote the set of such nodes by S. Figure 7.1 illus- 
trates the idea of the super-node. Note that now the graph can have multiple links 
and the transition probability for the random walk needs to be modified in propor- 
tion to the multiple links. 


(b) Modified network with a 
(a) Original network. super-node, S4 = {c,g,k, q}. 


Figure 7.1. Construction of a super-node. This figure has appeared first in Avrachenkov 
et al., 2016c. 
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Once the super-node is created, we can run the random walk in tours which 
start and end at the super-node, and use the following Ratio with Tours estimator 
(RT-estimator): 


DPA SS! F(v;,)/d(vi,) + 1/ds Dyes f 0 } 
DO EË 1/d(v;) + n/ds 


where ¢; is the length of the k-th tour, B is the sampling budget, m(B) is the number 


(7.10) 


fO = 


of tours until the budget is exhausted, i.e. 
k 
Ma = vhs > 6 By, (7.11) 


ds is the degree of the super-node, and 


ae fv), ifvgs, 
fe) = if ifve S. 


7.2 Tour-based Estimators for Motif Counting 


Motif counting is an important task in network analysis. For instance, we need to 
count triangles and wedges to calculate the (global) clustering coefficient. 

Cooper et al., 2016 proposed tour based estimators for efficient estimation of 
network motifs. We note that their approach can be combined with the idea of 
super-node. To keep explanations transparent, let us consider tours of the random 
walk, which start and end at a single node, say s. Then, as before, let č; denote the 
length of the j-th tour. Let z, be the stationary probability of the random walk to 
be at node s. Then, we know that 


E,1gj] = m = Pa 


Thus, we can use the following estimator for the number of links: 


m= 2— ) é (7.12) 
k= 


where m(B) is defined in (7.11). 

Next, if we want to estimate the number of triangles, we consider a random walk 
on a weighted network, where for each link {v, u} we assign a weight 1 + ¢({v, u}), 
with ¢({v, u}) being the number of triangles containing {v, u}. 
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The stationary distribution of the random walk on such weighted network is 
given by 


_ d(v) + Den) t({v, w}) 
E 2m + 6t(G) 


Ty 


3 


where ¢(G) is the number of triangles in the network. 
Thus, we can use the following estimator for the number of triangles: 


6m(B) 3 


: p EOF Luno 50) DEP À 


t = max ) 


> 


where m is an estimate of the number of edges, e.g., given by (7.12). 
It is straightforward to apply this approach to counting any network motif. 


7.3 Numerical Comparison of Sampling Methods 


7.31 Synthetic Networks 


We first consider a SBM with n = 20000 nodes clustered in two communities 
of respective sizes 200 and 19800. We let p11 = 0.3, while p12 = p22 = 0.001. 
This models a small sub-population in a large social network. As the function to 
average, we first choose f (v) = 1, if node v is in the smallest cluster, and f (v) = 0, 
otherwise. The results are plot in Figure 7.2. We observe that uniform sampling 
provides excellent results, even using only = 500 randomly chosen nodes, while 
the “naive” snowball sampling yields an over-estimation. This is expected since the 
standard random walk is biased towards large degree nodes, and in this situation 


7 0.09 
0.09 2 2 
0.06) 8 
0.06 
2 0.03 8 
one 8 3 ; 2 
a bi a EEN nea eee oe dtr D = 
E A = ieee == St == 
0.00} 0.00 + is 
RW MH RDS uniform RW MH RDS uniform 
(a) k = 500. (b) k = 2000. 
Figure 7.2. Different methods sampling the proportion of nodes in the smallest commu- 
nity of a SBM for a sampling budget of k = 500 and k = 2000. The two communities are of 
size 200 and 19800, and the probability of links are pj; = 0.3 while pi2 = p22 = 0.001. The 


correct proportion is thus 0.01, and the boxplots show the results of 100 sampling trials. 
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RDS uniform RDS uniform 
(a) k = 300. (b) k = 2000. 


Figure 7.3. The performance of RDS and uniform sampling for estimating the proportion 
of nodes in Group-A in SBM for a sampling budget of = 300 and k = 2000. The two 
communities are of size 500 and 49500, and the probability of links are p11 = 0.8 and pi2 = 
p22 = 0.0005. Nodes in the largest community belong to Group-C, whereas the nodes in the 
smallest community are equally split into Group-A and Group-B. The correct proportion 
of node from Group-A with respect to nodes from both Group-A and Group-B is thus 0.5, 
and the boxplots show the results of 100 sampling experiments. 


the large degree nodes are located in the smallest community. On the other hand, 
Metropolis-Hastings sampling and RDS successfully correct this bias. 

To show why uniform sampling might not always perform best, we propose the 
following scenario. As before, we take a SBM with a large and a small community 
(sizes 49,500 and 500, respectively). We affect the nodes of the small community 
into two groups of equal sizes (called Group-A and Group-B). The nodes in the 
large community are all assigned to another Group-C. The goal is to recover the 
proportion of nodes in Group-A among the nodes in the small community. A prac- 
tical motivation for this scenario could be that the small cluster represents a hard- 
to-reach sub-population, e.g., drug addicts. In this example the small community 
is further divided into the heavy users and the light users. One could be interested 
in the proportion of heavy users among the drug users. We assume that we know 
10 nodes that belong to Group-A. We merge those nodes into a super-node, and 
perform RDS on this modified graph, which we compare with uniform sampling. 
The results are shown in Figure 7.3. We observe that RDS with super-node gives 
estimation with much less variance. 


7.3.2 Real-world Network: DBLP 


We will now compare different sampling methods on the DBLP data set (n = 
317, 080 nodes and m = 1,049, 866 edges). In Figure 7.4, we estimate the aver- 
age degree, że., f(v) = d(v). We also estimate the number of nodes with degree 
larger than 50 by considering f (v) = 1(d(v) > 50) in Figure 7.5. In both cases, 
we observe that sampling using Metropolis-Hastings produces larger variance than 
RDS or uniform sampling. 
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Figure 7.4. Estimation of the average degree of the DBLP data set using different meth- 
ods, with budgets = 1000 and & = 10000. The boxplot show the results of 100 sampling 
experiments. The correct value of the average degree is 6.6. 


- 0.08 
0.08 
0.06 
0.06 
0.04 oor 
2 o 
0.02 gi 8 0.02 o 
H---- Na E A EN E E R E E A ED 
0.001 = > 0.00 = 
MH RDS uniform ' MH RDS uniform 
(a) k = 1000 (b) k = 10000 


Figure 7.5. Estimation of the proportion of large degree nodes (defined as having a 
degree larger than 50) in the DBLP data set using different methods, with budgets 
k = 1000 and k = 10000. The boxplots show the results of 100 sampling experiments. 
The correct proportion is 0.01. 


Further Notes 


An interesting approach proposed by Dasgupta et al., 2012, called social sampling, 
can be viewed as an intermediary between uniform node sampling and random 
walk based sampling. In this approach, once a node is sampled, the information 
about its neighbours also becomes available. Clearly, such an approach, if feasible, 
requires fewer samples than the uniform node sampling and avoids dependencies 
created by the random walk based methods. 

It can be beneficial to sample a network using multiple random walks run in 
parallel, see Ribeiro and Towsley, 2010. To improve the efficiency, the multiple 
random walks should be either dependent in a special way or independent but 
timed as continuous random walks with transition rates proportional to the node 
degree. 
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As discussed by Avrachenkoy et al., 2016b, in certain cases, it can be beneficial 
to skip some samples in chain-referral methods. Intuitively, skipping some samples 
reduces correlation in random-walk based methods. 

Instead of network functions defined over the nodes, one can consider network 
functions defined over the links or other motifs like triangles. For details on this, 
see Avrachenkov et al., 2016c. 
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Appendix A 


Background Material from Probability, 
Linear Algebra and Graph Theory 


A.1 Probability 


A11 Probability Toolbox 
In all the following, X (or X;) denotes a random variable (r.v.). 
Proposition A.1 


e For a, b constants, E(aX + b) = dE(X) + b; 

© EX +--+ + Xn) = EX) +--+ EX); 

e LetX bearv., A bean event and 14(X) be the indicator that event is realized 
by X. Then: 


E(14(X)) = P(X € A). 
Definition A.1. The variance of a random variable X is given by 


Var(X) = E((x = E(X))’). 
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Proposition A.2. We have the following results: 


© Var(X) = EQ?) — EWY; 

© for a, b constants, Var(aX + b) = a? Var(X); 

© ifX,...,Xm are mutually independent, then Var(Xı +- - -+Xm) = Var(X1)+ 
< + Var(X,); 

© ifwe do not have this independence, then Var(Xı +--+ -+Xm) = Var(Xı)+- -++ 
Var(Xm) + Di Cov (X;, X;), where Cov (X;, X;) = E(X:X;) — EXE X). 


A1.2 Basic Probability Laws 


Definition A.2. A random variable X is generated by a Bernoulli law with param- 
eter p € [0; 1], denoted X ~ Ber(p), if: 


1. X takes values in {0; 1}; 
2. P(X = 1) = p and P(X = 0) = 1 — p. 


Example A.1. A r.v. Ber(p) models the result when we toss a biased coin (p is the 
probability of winning the coin toss). 


Proposition A.3. Let X ~ Ber(p). We have EX = p and Var X = p(1 — p). 


Definition A.3. The binomial distribution with parameters n and p, denoted 
Bin(7, p), is the discrete probability distribution of the number of successes in a 
sequence of 7 independent Bernoulli trials with parameter p. 


Proposition A.4. [f(X;);=1,....n is a sequence ofn i.i.d. random variable distributed 
according to Ber(p), then X; X; ~ Bin(n, p). 


Corollary A.1. Let X ~ Bin(x, p). Then P(X = k) = (7) pea — pF, More- 
over, EX = np and Var X = np(1 — p). 


Definition A.4. The geometric distribution with parameter p, denoted Geo(p), 
is the probability of the number of Bernoulli trials (of parameter p) needed to get 
one success. In particular, if X ~ Geo(p), then X e {1,2,---} and P(X = k) = 
(1 —p)'p. 

1 


Proposition A.5. Let X ~ Geo(p). Then EX = > and Vat p = SH. 
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A1.3 Concentration of Random Variables 


First moment inequalities 


Proposition A.6 (Markov’s inequality). Let X be a random variable with positive 
values, and a € Ry. We have: 


EX 
a 


P(X > a) < 


Proof. EX > E(X1x>0) > aE( 1x4) = aP(X > a). 


1 
Remark A.1. By letting a = tEX, we obtain P(X > +tEX) < —-. The con- 
t 


vergence speed 1 is rather slow, and depending on the requirements may not be 
strong enough. 


Corollary A.2 (First moment method). Let X be a positive, integer-valued random 
variable. We have: 


P(X 40) < EX). 


The first moment is an upper bound on the probability that an integer random 
variable is not equal to zero. 


Proof. Since X is integer valued, we have P(X 4 0) = P(X > 0) = P(X > 1), 
and from there we can use Markov’s inequality. 


Application A.3 (Union bound). Let Aj,--+ ,Am be a collection of events. Then, 
P(4 U---UA,) < >: P(A,). 
i=1 


This can be shown by using the first moment method on X = 57”, 14, and 
observing that {X > 0} = A; U---UA,». 


Remark A.2. The first moment method is generally used when we have a sequence 


of integer, positive r.v. X, such that EX, — 0. In that case, X, — 0 almost surely. 


We could naively imagine that, if EX, > +00, then P(X, > 0) > 1. Unfor- 
tunately, this is not true, and the next example provides a counter-example. 


Example A.2. Let us take X, such that X, = n? with probability 1/z and X, = 0 
otherwise. Then, E(X,) = n > +00, but X, > 0. Loosely speaking, this happens 
because the variance of X, is very large. Indeed, Var X, = n? (n — 1). 
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Second moment inequalities 


Proposition A.7 (Chebyshev’s inequality). Let X be a random variable, anda > 0. 
We have: 


Var X 
a 


P(|X-EX| > 4) < 


a 


Proof. Apply Markov’s inequality to Y = (X — EX)’. 


2 

Example A.3. Let X be Gaussian N (0, 07). Then E[X| = 0 ,/ —. Then Markov’s 
1 

inequality applied to |X| gives 


|2 
P(X > a) < an 
T a 


while Chebyshev’s inequality leads to 
o\2 
P(X >a) < (=) 


Chebyshev’s inequality provides a stronger bound when a is large. 


Application A.4 (Weak law of Large Numbers). Let Xj, . . . Xn be independent r.v. 
with mean u and variance o? < +00. Then: 
> e) > 0. 


a 


With some extra work, we can show that the condition o% < +00 is not needed. 


Xit +X, 
n 


-u 


Moreover, the strong law of large numbers states that the convergence holds in fact 
almost surely (and not simply in probability, as we have here). 


Xech 
Proof. Applying Chebychev’s inequality for U, = a which has a 
n 


mean u and variance a”, leads to: 


oO 
P(|U,| > €) < E 0. 


Corollary A.5 (Second moment method). Let X be a positive random variable. We 
have: 


Var X E(X?) 
P(X =0 = — 1. 
= 0) < ot = (EX? 
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Proof: We apply Chebychev’s inequality with a = EX: 


P(X =0) < P(\X-EX|>Ex) < Z5 


where the first inequality holds since |X- EX| > EX > X < OorX > 2EX. 


Remark A.3. From Cauchy-Schwarz inequality, 


E(X) < E(Xlyso0) < VE(X2)/P(X > 0), 
Var(X) 
R(X?) 


and thus P(X = 0) = 1 — P(X > 0) < 


, which provides a slightly 


stronger inequality than Corollary A.5. 


Concentration of sums of i.i.d. random variables 


Proposition A.8 (Hoeffding’s inequality). Let X; be some independent random vari- 
ables, such that a; < X; < bi, and Sn = X}; Xj. Fort > 0, we have: 


°(5,2 BS, +1) < e (- 7), 
(s2 Es) < op (- 7), 
P(|s, - ES,| > e) < 2ep(- sa): 


More details about concentration inequalities can be found for example in Ver- 
shynin, 2018, Chapter 2. 


A.2 Graph Theory 


A.2.1 Definitions, Vocabulary 


Definition A.5. A graph G is a pair (V, E), where V is a finite set, whose elements 
are called nodes (or vertices, points) and £ is a set of ordered node pairs called edges 
(or links, lines, bonds). Moreover, we use the following vocabulary: 


e if (ï) E€ E 4 (ji) € E, then the graph is said to be undirected (this means 
that if there is a link going from å to j, there exists the same link in opposite 
direction); 

e the edges (i) are called se/floops. In particular, if for all nodes 7, (i) ¢ E, we 
say that there is no self-loops; 

e the graph is weighted if every edge (ij) € E has a weight w; > 0; 
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e we call in-degree of node i, denoted di", the number of (possibly weighted) 
edges coming to i, that is d; = È jey wji. Similarly, the out-degree of node i is 


the number of edges going from i, that is d?"" = Jey wij. For an undirected 


graph, di" = d?™ = d; and we simply call d; the degree of node i. 


Definition A.6. 


e We call a path of G of length k a sequence ¢1,..., ep of edges e; = (vj-1, vi) 
where the v; are vertices; 

e A k-cycle is a path of legnth & that starts and ends at the same vertex; 

e Suppose G is undirected. We say that two nodes u, v are connected if there 
exists a path going from u to v. We denote this as u © v. 


Proposition A.9. The relation 4> is an equivalence relationship for the undirected 
graphs. In particular, we can partition the nodes into equivalent classes, called the con- 
nected components. 


Proof: We have u <> u (path of length 0). Moreover, ifu > vand v © z, then u oO 
z (by combining the two paths); this ensures transitivity. Finally, u <> v implies 


v <> u (the same path, on the opposite direction): this ensures symmetry. 


Remark A.4. In particular, this means that there exists a path between two nodes in 
a same connected component. Reciprocally, no path connects two nodes belonging 
to two different connected components. 


Definition A.7. We say that G is connected if G has only one equivalent class under 
the relation ©. We say G is disconnected otherwise. 


In particular, in a connected graph, for every node 7 and j, there exists a path 
going from 7 to j. 


Definition A.8. Let 7,7 be two nodes. We call the distance between 7 and j, and 
denote d (i, j) the length of the shortest path between ż and j. If 4 j, then d(i,7) = 
+00. 


Definition A.9 (Diameter). We call diameter of a graph the largest distance 
between any pair of connected vertices. 


A.2.2  Adjacency Matrix 


Definition A.10. Let G = (V,£) be an unweighted graph with 7 nodes. The 
adjacency matrix of G (denoted by A) is the binary matrix A € {0, 1}”*” such that 
Ay = 1 if j) € E. 
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We can easily extend this definition to a weighted graph: the element Aj is then 
equal to the weight w;; of the edge between nodes i and j. 


Remark A.5. A is symmetric if and only if the graph is undirected. Moreover, the 
diagonal elements of A are zeros if and only if the graph does not have any self-loops. 


Definition A.11. We call the degree matrix, denoted D, of a graph G the diagonal 
matrix whose diagonal element D; is the degree of node 7. 


A.2.3 Graph Laplacians 


In the following, we will consider G as an undirected, weighted graph, on vertex 
set V = {1,..., n}. We denote by A the adjacency matrix of G, and by D its degree 
matrix. 


Definition A.12 (Graph Laplacian). We define: 


© the (standard or combinatorial) Laplacian L = D — A; 
© the normalized Laplacian L = D7 2, p-1/2 = I — D-1/24p71/2, 
© the PageRank Laplacian Lpr = I — D~'A. 


Remark A.6. Note that D~!/? and D~! are not well defined if there is an isolated 
node (a node of degree 0). We can either assume there is no such node in our graph, 
—1/2 —1 yee 

= D; = 0 if 7 isolated. 


or by convention we let D; 


Lemma A.6. Ifi and j are two neighboring nodes, we express this as i ~ j. Further- 
more, assume the graph does not have self-loops. Then, we have: 


1 ifi=j, 


di ifi=j, 
1 
fy=4-1 fin), and Ly= y4- JI ifi ~ j, 
0 otherwise ad 
0 otherwise. 


Proof. This is direct from the definitions of Z and £. 
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Basic properties of the Laplacians 


Proposition A.10. The standard Laplacian L = D—A has the following properties: 


1. For any vectorx € R” we havex! Lx = JÈ 3 ajj (xi — x). More generally, 
i=l] j= 


R”*xK 


for any matrix X € we have 


K 
Tr (x7zx) = SD (Xe — Xp)’; 


k=l ij 


2. L is symmetric and positive semi-definite; 
3. L has n non-negative, real valued eigenvalues O = A, < +++ < Àn. Moreover, 
Li, = Op. 


Proof. 1. Recall that d; = Di7_) aj. We can write 


eae 5 1 
2 2 
5 > ag (x; — x) = 5 > Aijx; — 2 > Aijxixj + > AjjX; 
ij ij ij 


ij=1 


— 


=o Dai -2 Dose + Dd) 


ij=l 


Yat — 5 xXixjdij 


ij=l1 
= S — xT Ax 


= xl Lx. 


R”xK 


More generally, for X € we notice that 


K 
Tr (x7zx) = > XTX, 
k=1 
where X., denotes the column & of X, and hence this result then holds by 
applying the previous result. 
2. L is symmetric because D and A are. From point 1, we have X TLX > 0, so 
L is positive semi-definite. 
3. L is symmetric, so its eigenvalues are real. It is positive semi-definite, so its 
eigenvalues are non-negative. Finally, 1, = 0, is straightforward using the 
formula derived in point 1. 
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Proposition A.11. The normalized Laplacian L satisfies the following properties: 


] 2 a 2 
1. For any vector x € R” we have x! Lx = 7; > a aij (+; - +). More 


generally, for any matrix X € R"™* we have 


r(e) = 1a i(4- Z); 


k=1 ij 


2. L is symmetric and positive semi-definite; 
3. L has n non-negative, real valued eigenvalues 0 = À; < +++ < Ay < 2. More, 
D!/?1, is an eigenvector of L associated to the eigenvalue 0. 


The proof of Proposition A.11 is similar to that of Proposition A.10. 


Standard Laplacian and the number of connected components 


Definition A.13 (Indicator vector of a set). Let U be a subset of the node set V. 
We define 1y as the n x 1 vector such that (1y); = 1 if¿ € U, and 0ifż¿ g U. 
We let 1, be the n-by-1 vector of all ones. 


Lemma A.7. LX = 0 & X is constant on each connected component of G. 


Proof: Let Vi,..., Vx be the connected components of G. Assume LX = 0. Then 
X™LX = 0, and from the formula of the previous proposition it follows that 
Vij € Vg: x; = xj. We conclude that LX = 0 implies that X is constant on 
each connected component of G. 

Reciprocally, we can see from the direct computation that if X is constant on 
each Vz, then LX = 0. 


Proposition A.12 (Number of connected components). Let G be an undirected 
graph with non-negative weights. Then, the multiplicity k of the eigenvalue 0 of L is 
equal to the number of connected components V\,..., Vz. Moreover, the eigenspace of 
eigenvalue O (Ker L) is spanned by the indicator vectors 1y, ..., 1y, 


Proof. If k = 1, it means the only eigenvector of 0 is X = 1,, and the graph is 
connected. 

Now suppose k > 1. We can assume that the vertices are ordered according to 
the connected components they belong to. Thus, L = diag(Z1,..., L4), where L; 
is the Laplacian of the 7 — th connected component. Each L; has eigenvalue 0 with 
multiplicity 1, and the corresponding eigenvector is the constant vector of ones. 


Thus, Z14, = Lj1, = 0, and each 1), is eigenvector of L associated to 0. 
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Example A.4. Suppose that the graph is connected. Then there is only one con- 
nected component (k = 1), so dim Ker L = 1, and the corresponding eigenspace 
is spanned by 1,. 


A.3 Linear Algebra 


A.31 Symmetric Matrices 


Theorem A.8 (Spectral theorem). ZFM is symmetric and real valued, then there exists 
an orthonormal basis consisting of eigenvectors of M. Moreover, the eigenvalues of M 
are real. 


1 i 
Counterexample A.9 (if M has complex entries). M = (; i) is symmet- 
p = 


ric but not diagonalizable. Indeed, from a direct computation of its characteristic 
polynomial, we can see that the only eigenvalue is 0. 


Definition A.14. A symmetric matrix M is said to be positive semidefinite (PSD) 
(resp., positive definite PD) if Vx € R” : x! Mx > 0 (resp., x! Mx > 0). 


Example A.5. For all M € R”*”, the matrix MTM is symmetric definite positive. 


Lemma A.10. Let M be a symmetric matrix, and 41,...,An its (real) eigenvalues. 
M is positive semidefinite (resp., positive definite) iff ài > 0 (resp., A; > 0). 


A.3.2 Norms 
Definition A.15. Let E be a vector space. A function N : E > R is a norm if it 


satisfies the following properties: 


1. (positivity) Vx € E : N(x) > 0; 

2. (definiteness) N(x) = 0 > x = Og; 

3. (homogeneity) Vx € E, t e R: N(ex) < |e|N(@); 

4. (triangle inequality) Vx, y € E : N(x +y) < N(x) +N). 
Vector norms 


Proposition A.13. Let E = R” and p > 1, we define the €?-norms as follows: 


n 1/2 
Ixl = (x st) 
i=1 


Proof. Tt is straightforward to show that ||.||» verify the first three conditions. The 
triangle inequality holds thanks to Minkowski inequality. 
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Example A.6. Let x = (x1,...,%n)/ € R”. We have the following particular cases 
of €?-norms: 


n 
1. forp = 1, zl; = $ Ixil; 


i=1 


n 
2. for p = 2, ||xll2 = ./ >> |x;|? is the Euclidian norm; 
i=1 


3. for p = 00, we define ||x|]oo = lim  ||xl|p = max{|x1|,..., |xnl}. 
porto 


Moreover, if we introduce the scalar product < x,y >= x! y between two vectors 


x,y € R”, then Ixl =x! x, 


Matrix norms (Serre, 2010) 


Definition A.16. Let ||.|| be a norm on R”, we define the operator norm |||.||| on 


R”*” induced by |].|| as 


|| Ax 


xER”: x£0, læ] 


IAI = 


By abuse of notation, we often denote by ||.|| the operator norm instead of |||. Ill. 


Lemma A.11. Let A e R”*”. 


|All] = sup ||Ax|| = sup ||Ax|| = max ||Axl|l. 
IIxl|=1 IIxll<1 Ilxll <1 


Example A.7. Let A € R”*”. We have the followig induced norms: 


1. |All; = sup |JAx|]) = max > MA; (max column-sum); 
Ixl1=1 fal j=] 
2. Allo = sup [Axlloo = max X> [Ail (max row-sum); 
I-lloo=1 phe 
3. |All, = sup Vx2ATAx = VAmax(AZA), where Amax(A™A) denotes the 
xT x=1 


largest eigenvalue of the (symmetric) matrix A’ A; 


1 
4. if A is invertible, then ||A7"Ill2 = Ina ATA)’ where Amin (47A) is the 


smallest eigenvalue of A’ A (non-zero if A~! is invertible). 


Proposition A.14. Let |||-||| Ge an induced operator norm. Then VA, B € R”*” : 
ILAB < IAI x HII. 
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Counterexample A.12. This inequality is false in general if the norm is not 
induced from a vector norm. For example, let V(A) = max;, |a;| (not to be con- 


fused with |||-|||,,), and A = G ) Then, N(A4”) = 2 > N(A)N(A) = 1. 


Definition A.17. We denote ||Al|z = > > |A|? the Frobenius (or 
i=1j=1 


Hilbert—Schmidt) norm of a matrix A e R”*”. 


A.3.3  Courant-Fisher Theorem 


Theorem A.13 (Courant-Fisher theorem). Let M € R”*” be a n-by-n symmetric 


matrix, and A, < +++ < Ay the eigenvalues of A, with associated normalized eigenvec- 
tors V1,...5 Uy. We have 
T 
: : x” Mx 
A, = min x! Mx = min T (A.1) 
xeR”: ||x||=1 xeR”: x40, Xxix 
T 
: _ x Mx 
A2 = min x! Mx = min 7 (A.2) 
xeR” xeR” xix 
llx|=1 xA0n, 
xLv, xL 
T 
x” Mx 
Ayn = max x’ Mx = max 7 (A.3) 
xeR” xeR” xix 
ici An 


Moreover, the respective arg min are obtained by vı, v2 and vy, respectively, 


Proof. Let us give two proofs, one by diagonalizing the matrix M, and one using 
calculus (Lagrange minimisers). 

(i) First proof. M being symmetric, we can write M = PDP. Let y = Px. 
Note that ||y|] = ||x||, thus the constraint ||x|| = 1 becomes ay = 1. Since 
x? Mx = y" Dy = $% 4, this expression is minimised (given the constraint) 
when all y; are zeros except for y1 = 1, and maximised when y, = 1 and all others 
yj are 0. If x L v, then y1 = 0 is further imposed, and y? Dy is minimised if yy = 1 
and other y; are null. 

(ii) Second proof. The Lagrangian associated to the minimisation problem (A. 1) 


(or (A.3)) is L(x, 4) = x? Mx — A(xTx — 1). Note that letting = = 0 gives 


back the constraint ||x|| = 1. Moreover, oh = = 2Mx — 21x, an hence a = 0 
leads to Mx = Ax. Thus, x is an eigenvector of M and 4 is the corresponding 


eigenvalue. As Equation (A. 1) is a minimisation problem, its solution is the smallest 
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eigenvalue. Similarly, the solution of equation (A.3) is the largest eigenvalue. Finally, 


the solution is the second smallest eigenvalue if we further impose that x L v1. 


Proposition A.15. Let M € R”*” be a symmetric matrix with v1,..., Vy being an 
orthonormal basis of eigenvectors associated with Ay < 12 < +++ < Àn. The solution 
of the optimisation problem 


arg min Tr (H7 MH) 
HeR?** 
H! H=Ix 


is given by H = [v,..., vg]. 
Proof. Consider the Lagrangian L(H, A) = Tr (HTMH ) - Tr (AT(H TH - 


[ )); where A e R**X is a diagonal matrix, whose entries are the Lagrange mul- 


tipliers. Since oe = 2MH — 2H A, the condition 2E = 0 leads to MH = HA. 
Thus, the columns of H are indeed eigenvectors of M, and the diagonals elements 


of ® are the corresponding eigenvalues. 


A.4 Calculus on Graphs 


We refer the reader to (Hein et al., 2007) for additional details on the topic of 
this section. 


A.4.1 Basic Reminders 

Consider a function f: R” —> R. The oe of A at a point x € R” is the 
vector Vf (x) = grad f(x) = (L£ o 2 Lo)" | The divergence is defined for 
every x € R” as divf(x) = i, £ # (x), and the Laplacian operator is Af (x) = 


pet a az Ti. In particular, we have div (grad f) (x) = Af. 


A.4.2 Extension on Graphs 


In this section, we consider a directed and weighted graph G = (V, E) whose 
weights are w; and node set is V = {1,..., n}. 


Functions on graph 


We denote by F(V) the set of node functions f: V — R. Since |V| = n, any node 
function f can be represented as a n-by-1 vector (f(1),...,f(”)) T and F(V) = 
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IR”. In particular, F(V) is a n-dimensional Hilbert-space whose inner product is 


< fg > FV) = S SOO 


vjieV 


and associated norm ||f||F(v) = y< fof > Fv). 
Similarly, the space of edge functions is F(E) = {F: E > R}. This space is 


equivalent to RIFI, and we introduce the inner product 


<F,G>zq= > FG/|GGA, 
(ij)eE 


and the norm |F| FŒ) = J< FF >FE). Finally, we can trivially extend any 
edge function F: E > R to a function F: V x V > R by letting F(v;, v) =0 
if (v; v) Z E. By a slight abuse of notation, we still denote by F the extended 
function. 


Differential graphs operators 


Let y : Ry — R+ such that y (0) = 0. The choice of y will be discussed later. 
The graph derivative of f along a directed edge (ż,j) € E is 


ð 
To =y ro: 


and denoted 0;f(¢) for convenience. In particular, 0;f(2) = 0 and f(z) = fQ) 
implies df (i) = 0. 

The graph gradient of a node function f e F(V) is denoted gradf, and is 
defined by 


VEE:  (gradf)Gj) = af. 


Hence grad: F(V) —> F(E) is a linear operator. The graph divergence div is 
defined to be the adjoint operator’ of grad, that is 


< grad f, G > F(p) = < f, div G >f(v) Vf € F(V), YG e F(E). 


Lemma A.14. The divergence div: F(E) — F(V) of the gradient operator is 
given by: 


(div G)() = Xy wG, i) — y wG j). 
J 


1. The adjoint is well defined here since the considered Hilbert spaces have finite dimensions. 
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Proof. We have 
< grad f,G > FŒ) = >. (wi) (FG) —f@) CE) 
iy 


= Dy wif DCE) -S y wO) 
iy 


hj 


= Ly wpfOeG.d — X y wif OGG) 
aj 


ij 
= “SO (7 MEGA — y (w) GGJ) 


= < f,div G >F(V).- 


For undirected graphs (w; = w;;), the divergence reduces to 
(div G)() = X y (wy) (GG. — GGA): 
J 


Finally, we define the graph Laplacian Ay : H(V) > H(V) such that A, f = 
div (grad f) for every f € F(V). 


Lemma A.15. Let f € F(V) andi € V. We have: 


(Arf) @ = E (r (ws) +7 a) FO-FO). 


J 
Proof: Let yj = y (wi). We have 


div (grad f) (2) = > Yj grad f G, i) — yx grad f (å, j) 
J 
= >) FO -fO) - 77 FO -FO) 
J 
= E (3 +73) (@-F0). 
J 


For an undirected graph, the Laplacian operator reduces to 


AFO = 2>°7 (wy) FO-FQ). 
J 
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Recall the standard Laplacian L = D — A, and let f : V > R bea node function 
represented by a n-by-1 vector. We have 


(LF); = difi— D wiih = Di (if). 
J J 


Thus, L = Ay if y (x) = yx. Similarly, the random walk Laplacian Lw = J — 
D~'!A verifies 


EES 2a- rae -= fi); 


F 


where the second equality holds since g = 1. Hence, Lw = Ay, if y verifies 


y (wij) = Vz for all 7,7. 
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Appendix B 


Additional Lemmas Related to the 
Proof of Theorem 5.5 


B.1 Mean-field Solution of the Secular Equation (5.19) 


B.1.1 Spectral Study of a Perturbed Rank-2 Matrix 


Lemma B.1 (Matrix determinant lemma). Suppose A € R”™” is invertible, and let 
U, V be two n by m matrices. Then, 


det(A + UV") = det A det (Ip + VTAT! U). 


Proof: We take the determinant of 
A -U\_ (A Au AU 
Ve 1I) WV I) \Ņo r+vtAatuy? 


- p = det I det (A + UV") by the Schur comple- 
ment formula (Horn and Johnson, 2012, Section 0.8.5). 


and we note that det 
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Proposition B.1. Let M = ZBZ", where B = ¢ 


b\.. f 
’) isa2 x 2 matrix, and 


l 0 ; ; 

Z= ( a? 7) isann x 2 matrix. Let m be an even number. We denote by Pr 
03/2 1n/2 

the n x n diagonal matrix, whose first 3 and last 3 diagonal elements are ones, all 


other elements being zeros. Then, 
det (a + AP — M) = EHATE L E G E ), 


with 


2 
t (Soro 28 (+5040) = 204+ im ). 


4 lfa n 2 
i 5(Se- 9-28 (i+ 50-5) - 20-im ). 


Proof: For now, assume that ż #4 —A and t Æ 0. Then, ¢/, + À Pr is invertible, and 
by Lemma B.1, 


det (a pipe = M) deti, APS dee (n Iz htio (-ZB)) 


= (t +1)” det (2 a7 APc)'ZB). 
(B.1) 


Moreover, 


1 1 A 
Pr. 
t 


_ 1 
Leip SAH Se 
( = c) 5 c) + +A $ t t(t + 2) 


Therefore, we can write 


Z' PZB 


7 1 
Zh APs) ZR = 7 
( F c) t t(t + 2) 


ln A m 
= B = xB, 


t2 t(t+ A) 2 


where x := 


t+A (1 — A) . Thus, a direct computation of the deter- 


n 


aa ( 


minant gives 


det (2 =Z (it APc) \ZB) = (1 Seiad D) (1 =e b). 
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Going back to equation (B.1), we can write 
det (+, ca he M) = (t+ A)” "-2P (1) Pa(2), (B.2) 


with Pi (t) = tt + A) — (at 6)(¢ +10 — #)) and P(e) = e(¢ + 4) — F(a 
b)(t +20 - m)). Since t e R +> det(td, + AP¢ — M) is continuous (even 
analytic), expression (B.2) is also valid for ¢ = 0 and ¢ = —A (Avrachenkov et al., 
2013a). We end the proof by observing that 


Pit) = ¢@-)e-4) and P(t) = t-e), 


where ¢; and ¢5 are defined in the proposition’s statement. 


Corollary B.2. Let A be the adjacency matrix of a DC-SBM with pin > Pou > O, 
and s be the oracle information. Let 4,t > 0, and d, = 2 (Pin + Pout) — nt, à = 
5 (Pin — Pout). Let A, :=A-Tly, 17 and Pr be the diagonal matrix whose element 
(Pc) ii is 1 ifs; A 0, and 0 otherwise. Then, the spectrum o EL = —EA, +4P — 

is {— yty eee el y + h; o}, where 


l fa = = 
= (4-24 V G44)? = 4.2.00 +m) ), 


H = AG ~24(2+a)"— 4020 +00), 


in — _ len n : 
Proof, Let M = ( pa™p Pa ‘) and Z = ( /2 0 2), Then, we notice 
Pow TT Pin —T 07/2 14/2 


that EA, = ZMZŤ and we can apply Proposition B.1 to compute the characteristic 
polynomial of EL. For x € R, det ( EĈ — xn) = det (= —x)I,-EA, +4P), 


whose roots are —y — t, —y — t,» —y, and —y +A. 


B.1.2 Estimation of 7, 


Lemma B.3. Let Yx be the solution of equation (5.19) for the mean-field model. Then, 
=a aa 210) < Ve < —a. 


Proof. For 2 > 0, we denote by (xj, Y+(4)) the solution of the system (5.17) on a 
mean-field DC-SBM. The proof is in two steps. First, let us show that y,.(0) = —a 
and y.(co) = —a(1 — 20). For 2 = 0, the constrained linear system (5.17) 
reduces to an eigenvector problem, and hence y,,(0) equals —a, the smallest eigen- 


value of —EA,. Moreover, when 2 = oo, the hard constraint xe = S¢ is enforced, 
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and the system (5.17) becomes 


ia EA: — Ya (On )uuXu = (EAr ueste 


xT x, = n(1 — no — M) 


and we verify by hand that y, (00) = —a (1 — 240) together with x, = Z, is indeed 
the solution. 

Second, if we let C} (x) = —x’ EA,;x + AG — Px)T G — Px) be the cost func- 
tion minimized in (5.14), then from equation (5.17) we have y.(A1) — 7%(A2) = 
Ca, 1) — Ca, 2) + 21x75 — 2x15. Since 2 > C(x) is increasing, then 
Ay < Az implies C}, (x1) < Cy, (x2). Since xls > 0 (if it was not the case, then 


Ci (—x1) < Cy(%z), and hence x, + arg min eg» Ca (x)), we can conclude that 
Y(0) < yx(A) and that Y (4) < 7a (00). 


B.1.3 Concentration of y, 


Proposition B.2. Let y, and y, be the solutions of equation (5.17) for a DC-SBM 
and the mean-field DC-SBM, respectively, Then 


x 27 (@ + Ay ) - 
V2 F nol — no)a2d 


Proof. The gradient with respect to (01, ..-, Ô» b1, «+» Ons y ) of the left-hand-side of 
equation (5.19) is equal to 


2. bi Ab; b;Að; biAy 
D 5; 7 RV; 
j=] Îi y i Yx ( 7 Va) ( i yx) 


Thus, we have 


Vx = Yal < ( 


>, (0; — 7x)? > (0; — 7)? > (0; — 7)? 
Firstly, we see that for all z € [nz], Ad; = |ô; — 6; < ||A-EA|l < d by the 
concentration of the adjacency matrix of a DC-SBM graph. Therefore, using this 
fact and 7, < 0, < ô < --- < Ôm 


: 1 = z 
maxi Gay? 2; lea - lb: — bil 


SP 1 72 
Mugo Ah 


Ay = |ys— Jal < max |ð; — 6;| + 


= max; (d;— 7) lei - lb; — bil 
dp” (° ] y pa feel 
min; (6; — 7s) >i b; 


lA 
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We notice that min; |ô; — 7.| = 0) — 7+. By using Lemma B.3 and the expression 
of ô given in Corollary B.2, we have 


min |ô; -7| > ata. 
Similarly, max; |ô; — 7+| = On — Yu = Ôn — 01 +01 — Fa. Corollary B.2 implies 


5, = Aandd; = } @ ~a—J/G+ay —4aA(no + m)) thus EE T 


Hence, using Lemma B.3, 


max |ð; — yal < ata. 


Therefore, we have 


>, Ibil- lb; — Sil 


EE at) “Sa 


(B.3) 


The term ee can be bounded as follow. Let Z = {i € [7]: b; Æ 0}. Then 


Do leil 162 — bil < max |b; = bil- D [Pil 


i icT 


Combining the Cauchy-Schwarz inequality 


li-l = 2[(Qi- n| < AQ- Qila IF, 


with the Davis-Kahan theorem (Yu et al., 2015) 


23/2 |A — EAI 
min {6; — 6-1, Õj+1 — di} 


| Qi- Qi, < = 


Isil = ~v (40 + 1)”, and the concentration of A towards EA, yields 


SET yay 
minjez {6 — 0j-1, 0;41 — ĝ;} 


max |b; — 6;| < 
iE 
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Using Lemma B.4, we see that Z = {i € [n] : ô; ¢ {0, t] }}. Combining it with 
Corollary B.2, gives 


min {ð; — 6-1, 041 — 6;} = A+ tf 


ata ar 
= 5 (i-i-a) 


TEE 
Z agan 11), 


where we used y1 — x < 1 — x/2. Therefore, 


: = 1 ata 
Dl Fil [ae — 2 < 23/2 | - Dall, 


By noticing that 9,4? > (3 |4i|)° > l Xb] => vae E; l:l 
where we used $; > Jn? fi (Lemma B.4), we have 


|B. IA. — B- 5/2 a 
Rill Wiel, __2 ED 
Èi (m — no) +o) ad 


Going back to inequality (B.3), this implies that 


E 27 (a + 4) ) va 
a (1+ J2./m F nom — o)a2A 


Lemma B.4. Let —EAt + AP = QAQ’, where A = diag (O1, he , Ôn) and 


QTQ = I„. Denote b = 1Q's. We have b) > „n-en n . Moreover, b; = 0 
if 6; = 0 or if 0; = —t7 


Proof. First, from Corollary B.2, 


a 1 2 
i = = -1(a-a+ (2+) = 4ā4 0n +). 


By symmetry, the i-th component of the first eigenvector Q.ı (associated with 1) 
is equal to 


vi Z; ifie [£], 
vo Z; if z g [£], 
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where vı and vo are to be determined. Thus, the equation (—EA,; + AP) Qi = 
61 Q. leads to 


a ((m + novi + (1 — m1 — 90)v0) = —t} vo 
a ((qı + novi + (1 — 41 — Yo)v0) + åvı = —1F 1, 


which, given the norm constraint ||v||2 = 1, yields 


1 it 
Chose ee ee aed ad ee 
Jn Jonta "+m =n) (+4) 
y 1 + +A 
0 = : 
Jn Jont) + =m —no)(f +4)? 


Since b; = Av’ 3 = A(nı — Yo)nv1, we have 
by E 


— = ÀA = No) -mmo 
Jon + n) (¢) +- m- 0) (+A) 


af ft 
The proof ends by noticing that is > a and al < a. Indeed, 


a 
Tan + no)a* + (1 — m1 — no)(a + 2)? 
AC — No) a 
= 2 
@+a on +m) (4) Thi 1G 


2 A(mi— yo) a 
= 2 A+a 


V 


> Am-1 


5" 
i Is 


This proves the first claim of the lemma. 

Similarly, by symmetry the i-th component of the eigenvector v’ associated with 
—t, equals v, if i € £, and v/, otherwise, and therefore (v’) 7s = 0. 

Finally, let Jọ := {i € [n] : 6; = 0}. By Corollary B.2, we have |Zo] = n(1 — 
nı — No) — 2. Since 0 is also eigenvalue of order m(1 — no — 41) — 2 of the extracted 
sub-matrix (-EA; + AP)„u = (-EAr) uy. we have for all k € Jo, Q = 0 for 


every 7 € [n]. Hence, for k € Ip, bg = AQis = 0. 


B.2 Mean-field Solution of the Constrained Linear 
System (5.17) 


In this section, we calculate the solution x to the mean-field model and deduce 
from it the conditions to recover the clusters. 
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Proposition B.3. Suppose that tT > Pou. Then, the solution of equation (5.18) on 
the mean-field DC-SBM is the vector x, whose element x; is given by 


C(-1+ (qı — 10)aB) Z;, ifi e € ands; £ Zi, 
x; = į C(1 + (m — 1o)aB) Z; ifi € € ands; = Z;, 
anon — no) A+ (mı + no)aB) Zi, ifig t, 


where & = 3 (Pin — pou) B = Tam Oa 07d C S y 
Proof. Let x be a solution of equation (5.18). By symmetry, we have 

Xi Zi ifje [£] and S; = Zi 
a xf Zis ifz € [£] ands; = —Z;, 

xo Zi, ifż ¢ [£], 


where x;, xp and xo are unknowns to be determined. Since for every i € [7], 


(EA,x); = &@ (xo(1 — m — m0) + xq + xfho)» 


the linear system composed of the equations ( (—EA, + AP — yun) x); = As; 
for all 7 € [7] leads to the system 


—a ((1 — m — no)xo + xem + xpo) — Faxo = 0, 
—a (a — Mı — No)xo + xı + x¢no) — Vax, + Ax, =A, 
=o (a — m1 — No)xo + xı + xp No) _ VaXf + Axf = — 


The rows of the latter system correspond to a node unlabeled by the oracle, correctly 
labeled and falsely labeled, respectively. This system can be rewritten as follows: 


so = apog e Eney)» 
Faxo + xA — Fs) =A, 
VxX0 + xp(A = Vx) = = 


In particular, we have x, — xf = 7 By subsequently eliminating xo and x; in 


the equation y.x9 + xp — 74) = —A, we find 


ene (-1+ ax (M1 — No) ). 
ae 2a — m — no) + Afs — Je(@ + Ju) 


_— (+z ays (m= 10) =). 
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and finally 
—a 
atl = qı — Ho) + y» 
a (1+ ays (m + 10) ). 
A— Ve Aa(l — 41 — No) + As — s(a + 7x) 


xo = 


Corollary B.5. Suppose that t > Pour. Then sign (xj) = sign (Z;) if 


© node i is not labeled by the oracle; 
© node i is correctly labeled by the oracle; 
© node i is mislabeled by the oracle and i < (1 — 2n9)a4 


=e 
Proof. A node ż is correctly classified by decision rule (5.16) if the sign of x; is equal 
to the sign of Z;. Using Lemma B.3 in Appendix B.1.2, we have —& < yx. < 
—a a — ei Therefore, the quantities B and C in Proposition B.3 verify C > 0 
and + May me <B< NEEN The statement then follows from the expression of 
x; computed in Proposition B.3. 
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